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RESEARCH  OVERVIEW 


The  research  vehicle  for  this  contract  is  the  largest  possible  computer  that  could  be  conceived  for  the 
mid  to  late  1990s.  The  technical  challenges  of  such  a  machine  serve  as  the  guiding  stimulus  for  the  research 
carried  out  and  reported  here. 


We  imagine  thTsTnachine  to  occupy  a  14-story  building,  to  cost  upwards  of  $1,000,000,000,  and  to  be  so 
colossal  that  the  nation  could  only  afford  one  or  two  of  them.  The  available  chip  technology  and  machine  size 
are  consistent  with  a  million  billion  FLOPS  (that’s  10  to  the  15th)  and  a  million  billion  Bytes  of  memory.  It  will 
dissipate  50  megawatts  of  power  using  CMOS  technology.  Communication  across  the  machine  will  be  much 
slower  than  computation  at  a  node.  The  architecture,  software,  interconnect  technology,  packaging,  and 
operating  system  are  unknown.  \ 


-This  investigation  deals  with  hardware  technology,  software  techniques,  programming  algorithms, 
communications,  processing  elements,  and  applications.  The  study  will  determine  the  plausibility  (not 
feasibility)  of  such  a  machine.  Progress  in  these  various  areas  are  highlighted  in  the  individual  sections  below. 
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CIRCUITS 

Our  work  over  the  past  six  months  has  been  driven  by  two  centra]  themes:  (1)  the  realization  that  with 
proper  architectural  support,  we  need  not  arbitrarily  choose  a  single  programming  model  to  support  on  a 
parallel  processor,  and  (2)  an  investigation  of  the  deep  relationship  between  constructing  reliable,  distributed, 
parallel  memory  systems  and  more  conventional  software  database  technology. 

The  support  of  multiple  programming  models  (shared  memory,  dataflow,  functional,  message  passing, 
systolic,  data  level  parallelism  etc.)  on  a  single  architecture  appears  now  to  be  not  only  possible,  but  inevitable. 
With  modest  adjustments  to  processor,  cache  controllers,  and  interconnection  technology,  the  requirement  that 
each  new  parallel  programming  model  demands  a  new  system  level  design  appears  to  be  gone.  In  the 
processor,  the  main  requirements  are  fast  context  switch  and  fast  message  dispatch  and  composition.  In  the 
cache  controller,  the  requirement  is  for  support  of  one  of  the  newer  message  oriented  cache  coherency 
protocols.  In  the  switch,  the  requirement  is  for  extremely  low  latency,  reliable  communications. 

Our  approach  has  been  to  attack  the  interconnection  issues  first,  on  the  assumption  that  they  were  least 
likely  to  be  addressed  carefully  by  others,  and  that  they  were  (or  could  be  made)  simpler  and  easier  to  get  right 
than  the  more  complex  designs  for  cache  and  processor.  The  Transit  communication  network  is  the  outgrowth 
of  this  work. 

Transit*  is  a  small  scale  prototype  network  designed  with  fault  tolerance  and  low  latency  as  key  design 
goals.  Henry  Minsky  and  Andre  DeJon  are  working  with  me  in  the  development  of  a  custom  VLSI  component 
for  this  switch.  The  goal  is  to  provide  a  100  megabyte/second  channel  with  40  ns.  latency  between  pairs  of  up  to 
256  processor  ports  in  a  reliable,  fault  tolerant  design. 

Packaging  is  key,  and  we  are  developing  a  liquid  cooled  three-dimensional  package,  based  on  the  button 
board  concept,  which  provides  approximately  equal  wiring  density  in  all  three  axes.  The  goal  is  to  package  an 
entire  large  scale  multiprocessor  into  a  solid  cubical  block  of  printed  circuit  board  material  approximately  18" 
on  a  side. 

Innovative  techniques  are  being  used  to  transmit  data  between  chips*  modified  by  replacing  the 
voltage  controlled  output  switches  by  a  binary  weighted  D/A  resistive  network. 

The  switch  design  is  extremely  simple,  and  involves  no  buffering,  queuing,  or  combining,  opting  instead 
to  concentrate  on  maximum  speed  of  transmission.  Fault  tolerance  is  achieved  with  a  combination  of  random 
route  choice  between  equivalent  paths  and  ethemet-style  positive  acknowledgement  plus  retry. 

High  expected  success  routing  rates  are  achieved  by  the  use  of  a  2x  dilated  omega  network  topology, 
which  as  Leighton  and  Koch  show,  has  dramatically  improved  statistics  over  the  conventional  omega  network. 

Despite  our  detailed  attention  to  the  routing  network  design,  preliminary  thought  is  also  being  given  to 
other  portions  of  the  design. 

In  the  processor  area,  we  are  looking  at  techniques  for  speeding  process  switching,  including  the  VLSI 
design  of  large  multi-ported  register  files  with  built  in  backup  copies,  allowing  single  cycle  register  file  switching. 


•  T.  Knight,  “Technologies  for  Low  Latency  Interconnection  Networks,”  Symposium  on  Parallel  Algorithms 
and  Architectures,  June,  1989 

f  T.  Knight,  “A  Self-Terminating  Low  Voltage  Swing  CMOS  Driver,”  Journal  of  Solid  Stale  Circuits,  April, 
1988. 
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In  the  cache  area,  we  are  investigating  a  variety  of  message  based  protocols  cooperatively  with  Anant 
Agarwal’s  group.  In  addition,  Neil  Lackritz  is  working  with  me  in  attempting  to  understand  the  impact  that  a 
knowledge  of  hardware  datatypes  and  their  properties  have  on  cache  performance.  Our  hope  is  that  we  may 
gain  cache  performance  over  more  conventional  designs  by  relying  on  a  combination  of  compiler  directives  and 
run  time  types  of  data  to  control  the  cache  refill  algorithms. 

More  fundamentally,  we  are  beginning  in  earnest  to  examine  the  relationship  between  conventional 
software  oriented  database  technology  and  the  problem  of  maintaining  consistent,  replicated,  distributed  copies 
of  main  memory. 

Alan  Bawden,  for  example,  is  continuing  his  work  on  the  utility  and  importance  of  side  effects.  His  most 
important  results  to  date  include  the  first  mathematical  description  of  what  a  side  effect  is;  he  is  now 
implementing  a  model  of  computation  in  which  the  costs  (and  benefits)  of  side  effects  are  explicitly  visible  to 
the  programmer. 

Patrick  Sobalvarro  is  extending  our  earlier  work  on  dynamically  checked  side  effects’  by  collecting 
traces  of  programs  and  implementing  the  coherency/concurrency  techniques. 

Bryan  Butler,  in  collaboration  with  Draper  Labs,  is  designing  a  novel  fault  tolerant  main  memory 
structure  based  on  coding  techniques,  which  is  suitable  for  use  in  four  way  Byzantine  fault  tolerant 
architectures,  uses  half  of  the  memory  of  alternative  approaches,  and  dramatically  (3  orders  of  magnitude) 
improves  the  predicted  mean  time  to  failure  of  the  memory  system. 


T.  Knight,  “An  Architecture  for  Mostly  Functional  Languages,”  ACM  Lisp  and  Functinal  Programming 
Conference,  August,  1986. 
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PROCESSING  ELEMENTS 

Our  goal  is  to  study  the  issues  involved  with  and  to  develop  technology  for  constructing  building-sized 
multicomputers  with  1997  technology.  These  machines  will  be  of  such  a  scale  that  their  design  will  have  to 
make  the  most  efficient  use  of  wires  and  energy. 

The  Named  State  Processor: 

A  computer  the  size  of  the  ARC  will  require  the  ability  to  switch  tasks  rapidly  to  hide  transmission 
latency  without  sacrificing  single-thread  performance.  Peter  Nuth  and  Bill  Dally  are  working  on  an  architecture 
for  a  named  state  processor  that  achieves  this  goal  by  explicitly  binding  names  to  all  processor  registers  and 
interleaving  tasks  on  a  microcycle  basis.  This  mechanism  combines  the  advantages  of  multi-threading  and 
multiple  register  sets  for  implementing  fast  context  switches  and  procedure  calls.  It  also  provides  a  general 
synchronization  mechanisms. 

Naming  state  permits  process  switches  to  be  performed  in  essentially  zero  time  as  no  registers  need  be 
saved  or  restored.  A  process  switch  is  performed  by  simply  issuing  an  instruction  fetch  with  a  different  process 
ID  field.  Unlike  conventional  multithreaded  processors,  this  approach  also  permits  the  instructions  of  a  single 
process  to  be  pipelined,  executing  one  per  cycle,  achieving  good  single  thread  performance. 

Naming  the  processor  state  permits  several  processes  to  have  instructions  in  the  pipeline  simultaneously. 
Pipeline  bubbles  due  to  data  dependencies,  memory  latency,  or  interprocess  communication  are  filled  by 
advancing  instructions  from  other  processes. 

The  named  state  processor  performs  all  synchronization  through  the  use  of  presence  tags  on  its  state. 
Synchronization  on  register  dependencies,  memory  references,  and  communication  actions  all  use  this  single 
mechanism. 

Since  our  last  report  we  have  refined  the  named  state  processor  architecture  and  defined  its  interface  to 
a  multicomputer  network.  We  are  currently  studying  instruction  scheduling  policies  (deciding  which  processes 
instructions  get  advanced  when)  and  context  cache  management  policies  (deciding  which  processes  state 
remains  in  active  storage).  A  simulator  for  the  processor  is  under  construction. 

Concurrent  data  abstractions: 

Andrew  Chien  and  Bill  Dally  are  developing  data  abstraction  tools  that  support  the  development  of 
programs  for  large  scale  multicomputers.  A  language,  concurrent  data  abstractions,  is  being  defined  that 
facilitates  the  specification  of  aggregates  of  cooperating  objects.  Concurrent  data  abstractions  permit  the 
relationships  between  objects  to  be  defined  textually  rather  than  requiring  that  the  objects  connect  up  a  pointer 
structure  at  run-time  as  is  typically  done.  Common  structures  (e.g.,  combining  trees)  can  be  defined  once  and 
reused  as  required.  The  language  also  permits  nesting  of  object  aggregates  and  specialization  of  objects  within 
the  aggregate. 

Database  applications: 

John  Keen  and  Bill  Dally  are  investigating  the  application  of  an  ARC  sized  computer  to  database 
applications.  The  issues  involved  include  data  partitioning,  methods  for  insuring  stability  and  persistence, 
concurrency  control,  and  efficient  algorithms  for  search  and  update. 


Fast  Translation  Method: 
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Bill  Dally  has  developed  a  one-step  translation  method  that  implements  paging  on  top  of  segmentation. 
This  method  translates  a  virtual  address  into  a  physical  address,  performing  both  the  segmentation  and  paging 
translations,  with  a  single  TLB  read  and  a  short  add.  Previous  methods  performed  this  translation  in  two  steps 
and  required  two  TLB  reads  and  a  long  add.  Using  the  fast  method,  the  fine-grain  protection  and  relocation  of 
segmentation  combined  with  paging  can  be  provided  with  delay  and  complexity  comparable  to  paging-only 
systems.  This  method  allows  small  segments,  particularly  important  in  object-oriented  programming  systems, 
to  be  managed  efficiently. 

Floating  point  optimization  technique: 

Bill  Dally  and  Lucien  Van  Elsen  have  developed  a  technique,  micro-optimization,  for  reducing  the 
operation  count  and  time  required  to  perform  numerical  calculations.  The  method  involves  breakup  floating¬ 
point  operations  into  their  constituent  integer  micro-operations  and  the  optimizing  and  scheduling  the  resulting 
integer  code.  The  method  has  been  tested  using  a  prototype  expression  compiler.  We  are  now  looking  at 
extending  the  method  to  permit  a  compiler  to  perform  automatic  scaling  of  numbers.  Where  it  is  possible,  this 
optimization  would  convert  floating  point  expressions  into  integer  expressions. 
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COMMUNICATIONS  TOPOLOGY  AND  ROUTING  ALGORITHMS 

Charles  Leiserson  returned  from  a  leave  of  absence  at  Thinking  Machines  Corporation  January  1, 1989. 
He  was  an  invited  speaker  at  the  25th  Anniversary  Symposium  for  Project  MAC  at  MIT,  and  at  the  Decennial 
Caltech  VLSI  Conference.  He  also  served  on  the  first  program  committee  for  the  ACM  Symposium  on  Parallel 
Algorithms  and  Architectures. 

Charles  Leiserson  and  Tom  Cormen  have  been  concentrating  on  finishing  the  textbook  Introduction  to 
Algorithms  with  Ronald  Rivest.  Besides  offering  a  combined  engineering  and  theoretical  approach  to  computer 
algorithms,  the  book  has  several  chapters  devoted  to  parallel  computing  --  a  novelty  in  the  area.  The  book  will 
be  published  jointly  by  MIT  Press  and  McGraw-Hill  by  the  end  of  1989. 

Shlomo  Kipnis  has  been  investigating  parallel  architectures  and  interconnection  networks.  He  is  trying 
to  further  explore  the  power  of  bussed  interconnection  schemes  to  route  permutations.  In  addition,  he  is 
investigating  the  mesh  and  the  hypercube  interconnection  schemes,  and  is  looking  into  the  problem  of 
embedding  one  network  in  another  network  Recently,  he  has  also  studied  the  problem  of  range  queries  in 
computational  geometry.  Range  queries  is  a  fundamental  problem  in  computational  geometry  with  applications 
in  computer  graphics  and  database  retrieval  systems. 

Marios  Papaefthymiou  joined  the  group  in  December  1988.  He  has  been  working  with  Charles  Leiserson 
on  algorithms  for  optimizing  synchronous  circuitry.  Recently,  he  discovered  an  simple  0(E)  algorithm  for 
pipelining  combinational  circuitry  to  achieve  a  given  clock  period. 

Jeff  Fried  is  currently  working  on  several  problems  related  to  the  architecture  and  control  algorithms 
needed  for  high  performance  communication  networks  This  work  includes  a  study  of  the  impact  of  synchrony 
on  the  performance  of  distributed  algorithms,  and  design  studies  for  a  VLSI  packet  router  chip.  Fried 
completed  his  Master’s  thesis  this  semester.  His  thesis  work  involved  the  design  of  VLSI  processors  for  use 
within  the  interconnection  networks  found  in  telecommunications,  distributed  computing,  and  parallel 
processing.  He  also  completed  a  study  of  some  of  the  modularity  tradeoffs  found  in  sparse  circuit-switched 
interconnection  networks.  In  the  next  year,  Fried  plans  to  continue  his  study  of  distributed  algorithms  and 
architectures  to  support  them.  He  will  also  be  considering  a  number  of  other  problems  relating  to  the  design  of 
switching  nodes  for  use  in  broadband  networks. 

Cynthia  Phillips  continued  her  investigation  of  parallel  graph  contraction  algorithms  which  lead  to 
simple  algorithms  for  connected  components,  biconnectivity,  and  spanning  trees  of  graphs.  She  developed  a 
simple  contraction  algorithm  for  general  n-node  graphs  which  runs  in  0(lgzn)  time  using  0(n) 
processors  on  an  EREW  PRAM.  This  algorithm  is  used  in  a  contraction  algorithm  for  bounded-degree  graphs 
which  runs  in  0(lg  n  +  lg2g)  time  where  g  is  the  maximum  genus  of  any  connected  component.  Also, 
with  Stavros  Zenios  of  the  Wharton  School  at  U.  Pennsylvania,  she  began  an  investigation  of  parallel 
implementations  of  algorithms  for  network  optimization  problems.  In  particular,  they  investigated  the  behavior 
of  known  algorithms,  embellished  with  heuristics,  for  the  assignment  problem  and  nonlinear  network  flow. 

In  the  past  six  months,  James  K.  Park  has  been  collaborating  with  Alok  Aggarwal  and  Dina  Kravets  on  a 
number  of  problems  relating  to  totally  monotone  arrays.  Totally  monotone  arrays  arise  naturally  in  a  wide 
variety  of  fields,  including  computational  geometry,  dynamic  programming,  string  matching,  and  VLSI  river 
routing.  Park’s  work  with  Aggarwal  centers  on  the  problem  of  finding  maximum  entries  in  totally  monotone 
arrays  and  applications  of  efficient  sequential  and  parallel  algorithms  for  this  problem.  Park’s  work  with 
Kravets  considers  other  comparison  problems  (such  as  sorting  and  computing  order  statistics)  in  the  context  of 
totally  monotone  arrays  and  applications  of  efficient  solutions  to  these  proc  lems. 
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Alexander  Ishii  has  completed  his  masters  thesis*,  which  describes  his  models  for  VLSI  timing 
analysis.  The  model  maps  continuous  data-domains,  such  as  voltage,  into  discrete,  or  digital,  data  domains, 
while  retaining  a  continuous  notion  of  time.  The  majority  of  the  thesis  concentrates  on  developing  lemmas  and 
theorems  that  can  serve  as  a  set  of  “axioms”  when  analyzing  algorithms  based  on  the  model  Key  axioms 
include  the  fact  that  circuits  in  our  model  generate  only  well  defined  digital  signals,  and  the  fact  that 
components  in  our  model  support  and  accurately  handle  the  “undefined”  values  that  electrical  signals  must  take 
on  when  they  make  a  transition  between  valid  logic  levels.  In  order  to  facilitate  proofs  for  circuit  properties,  the 
class  of  computational  predicates  is  defined.  A  circuit  property  can  be  proved  by  simply  casting  the  property  as 
a  computational  predicate. 

Ishii  has  also  been  working  with  Bruce  Maggs  on  a  new  VLSI  design  for  a  high-speed  multi-port  register 
file.  Design  goals  include  short  cycle-time  and  single-cycle  register  window  context  changes.  This  research 
began  as  an  advanced  VLSI  class  project,  under  the  supervision  of  Thomas  Knight  of  the  MIT  Artificial 
Intelligence  Laboratory. 

Ishii  has  also  been  working  with  Ronald  Greenberg  and  Alberto  Sangiovanni-Vincentelli  of  Berkeley  on 
a  multi-layer  channel  router  for  VLSI  circuits,  called  MulCh*.  While  based  on  the  Chameleon  system 
developed  at  Berkeley,  MulCh  incorporates  the  additional  feature  that  nets  may  be  routed  entirely  on  a  single 
interconnect  layer  (Chameleon  requires  the  vertical  and  horizontal  sections  of  a  net  be  routed  on  different 
interconnect  layers).  When  used  on  sample  problems,  MulCh  shows  significant  improvements  over 
Chameleon  in  area,  total  wire  length,  and  via  count. 

Besides  his  work  with  Ishii  and  Sangiovanni-Vincentelli,  Ronald  Greenberg  has  been  continuing  work  in 
two  areas:  channel  routing  for  VLSI  chips  and  area-universal  networks  for  general-purpose  parallel 
computation.  With  Miller  Maley  of  Princeton,  he  has  been  developing  techniques  for  efficiently  determining 
minimum  area  routings  for  single-layer  channels  and  switchboxes,  which  can  be  usefully  incorporated  into 
previous  work  on  multi-layer  channel  routing.  In  the  area  of  general-purnose  parallel  computation  he  has 
developed  stronger  and  more  general  results  about  the  ability  of  a  machine  built  in  a  fixed  amount  of  area  to 
simulate  other  parallel  machines. 

Bruce  Maggs  has  been  working  with  Richard  Koch,  Tom  Leighton,  Satish  Rao,  and  Arnold  Rosenberg. 
They  have  been  studying  the  ability  of  a  host  network  to  emulate  a  possibly  larger  guest  network.  An  emulation 
is  called  “work-preserving”  if  the  work  (processor-lime  product)  performed  by  the  host  is  at  most  a  constant 
factor  larger  than  the  work  performed  by  the  guest.  A  work-preserving  emulation  is  important  because  it 
achieves  optimal  speedup  over  a  sequential  emulation  of  the  guest.  Many  work-preserving  emulations  for 
particular  networks  have  been  discovered.  For  example,  the  N-node  butterfly  can  emulate  an  N  log  N  node 
shuffle-exchange  graph  and  vice  versa.  On  the  other  hand,  a  work-preserving  emulation  may  not  be  possible 
unless  the  guest  graph  is  much  larger  than  the  host.  For  example,  a  linear  array  cannot  perform  a  work- 
preserving  emulation  of  a  butterfly  unless  the  butterfly  is  exponentially  larger  than  array.  Worse  yet,  a  work- 
preserving  emulation  may  not  exist.  A  butterfly  cannot  perform  a  work-preserving  emulation  of  an  expander 
graph.  These  positive  and  negative  results  provide  a  basis  for  comparing  the  relative  power  of  different 
networks. 


*  Alexander  T.  Ishii,  A  Digital  Model  for  Level-Clocked  Circuitry,  Master’s  thesis.  Department  of  Electrical 
Engineering  and  Computer  Science,  MIT,  August  1988. 

f  Ronald  I.  Greenberg,  Alexander  T.  Ishii,  and  Alberto  L.  Sangiovanni-Vincentelli.  MulCh:  A  multi-layer 
channel  router  using  one,  two,  and  three  layer  partitions.  In  IEEE  International  Conference  on 
Computer-Aided  Design  (1CCAD-88),  pages  88-91,  IEEE  Computer  Society  Press,  1988. 
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SYSTEMS  SOFTWARE 

Our  research  over  the  past  year  has  addressed  the  design  of  directory  systems,  interconnection  networks, 
and  processing  elements  for  large-scale  multiprocessors  with  coherent  caches.  Because  the  building-sized 
American  Resource  Computer  must  exploit  locality  to  scale  far  beyond  the  limits  of  current  multiprocessors,  a 
major  part  of  our  effort  was  devoted  to  the  issue  of  locality.  Briefly,  we  have  developed  a  model  of  memory 
referencing  locality  to  analyze  address  streams  of  existing  parallel  applications,  modified  our  Mul-T  compiler 
and  run-time  system  to  derive  statistics  on  locality  patterns  in  multiprocessor  applications,  derived  new 
performance  evaluation  models  that  capture  the  effects  of  locality.  We  are  also  investigating  efficient 
performance  analysis  and  data  collection  techniques  for  large-scale  multiprocessors,  and  task  scheduling 
strategies  to  enhance  locality. 

Continuing  our  efforts  in  parallel  trace  data  collection,  we  now  have  a  tracer  called  TMul-T,  that  can 
generate  traces  for  parallel  symbolic  applications.  An  implementation  of  TMul-T  on  the  Encore  Multimax  runs 
with  a  slowdown  of  less  than  30x  on  a  single  processor.  A  port  of  TMul-T  to  the  DEC  MICROVAX  is  also 
complete.  We  are  now  porting  TMul-T  to  the  MIPS  processor.  We  have  gathered  several  large  traces  of 
symbolic  applications  written  in  Mul-T  including  MODSIM  —  a  functional  simulator,  BOYER  —  a  theorem 
prover,  and  several  smaller  applications.  We  have  continued  tracing  parallel  C  applications  under  the  MACH 
operating  system  using  the  VAX  T  bit  technique.  In  a  joint  effort  with  IBM,  we  have  derived  large  parallel 
FORTRAN  traces  using  a  postmortem  scheduling  method  that  can  incorporate  multiple  synchronization 
models.  FORTRAN  traces  include  SIMPLE,  WEATHER,  and  FFT.  We  are  using  these  traces  in  a  wide 
variety  of  studies  ourselves,  and  we  also  plan  to  distribute  our  trace  data  to  the  research  community  and  to 
industry.  A  slight  modification  to  our  parallel  TMul-T  tracer  has  also  enabled  the  emulation  of  large-scale 
multiprocessors. 

We  now  have  running  simulators  for  cache/directory  systems  and  interconnection  networks.  These  two 
simulators  can  be  plugged  back  to  back  to  provide  the  system  backend  to  a  processor  emulator.  Currently  we 
can  use  either  our  TMul-T  system  or  the  FORTRAN  post-mortem  scheduler  as  the  processor  emulator. 

We  have  a  new  model  representing  memory  referencing  locality  in  multiprocessor  systems.  This  locality 
model  suitable  for  multiprocessor  cache  evaluation  is  derived  by  viewing  memory  references  as  streams  of 
processor  identifiers  directed  at  specific  cache/memory  blocks.  This  viewpoint  differs  from  the  traditional 
uniprocessor  approach  that  uses  streams  of  addresses  to  different  blocks  emanating  from  specific  processors. 
Our  view  is  based  on  the  intuition  that  cache  coherence  traffic  in  multiprocessors  is  largely  determined  by  the 
number  of  processors  accessing  a  location,  the  frequency  with  which  they  access  the  location,  and  the  sequence 
in  which  their  accesses  occur.  The  specific  locations  accessed  by  each  processor,  the  time  order  of  access  to 
different  locations,  and  the  size  of  the  working  set  play  a  smaller  role  in  determining  the  cache  coherence 
traffic,  although  they  still  influence  intrinsic  cache  performance.  We  have  some  initial  results  that  show  that 
these  processor  references  directed  to  a  memory  blocks  display  the  LRU  stack  property.  If  we  succeed  in 
showing  this  is  indeed  true  across  a  large  set  of  parallel  applications,  then  the  abundant  literature  on  LRU  stack 
evaluation  for  single  processors  can  be  straightforwardly  used  in  evaluation  of  multiprocessor  performance. 

We  are  investigating  novel  VLSI  processor  architectures  for  large-scale  multiprocessor  systems.  A 
processor  called  APRIL  is  being  designed.  (This  processor  borrows  heavily  from  the  MARCH  processor 
design  of  Bert  Halstead  at  the  MIT  Laboratory  for  Computer  Science,  and  the  Stanford  MIPS-X  processor 
design,  but  differs  substantially  from  the  two.  Unlike  MARCH,  APRIL  has  hardware  interlocks  in  the  pipeline, 
does  not  interleave  process  threads,  and  uses  software  thread  scheduling.  Unlike  MIPS-X,  it  allows  multiple 
hardware  contexts,  and  has  hardware  support  for  synchronization  and  futures.)  The  chief  issues  being 
addressed  in  this  design  are  rapid  context  switching,  fast  trap  handling,  high  single  thread  performance, 
hardware  support  for  synchronization  and  futures,  and  register  file  organization.  An  important  observation  of 
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our  study  so  far  has  been  identifying  the  specific  hardware-software  tradeoffs  one  must  make  for  achieving 
overall  high  system  performance.  Some  examples  include  hardware  versus  software  for  fine-grain  task 
management  and  scheduling  in  a  multithreaded  processor,  and  hardware  provided  synchronization  primitives 
such  as  fetch-and-op  versus  software  synthesized  primitives  from  basic  interlocked  load/store  instructions.  We 
currently  have  a  preliminary  instruction-set  specification.  A  Mul-T  compiler  for  this  processor  and  a  simulator 
are  also  being  written. 

The  design  of  a  cache-directory  and  network  communications  controller  to  be  used  in  a  large-scale 
multiprocessor  is  in  progress.  The  chief  issues  being  addressed  are:  the  programmability  and  the 
implementation  efficiency  of  various  shared-moir  ory  programming  paradigms,  such  as  strong  serialization 
versus  weak  ordering;  Supporting  full-empty  bits  in  the  cache  directory  controller;  Tradeoffs  in  controller 
design  to  support  context  switching,  such  as  re-issuing  instructions  versus  pipeline  freezing. 

We  have  analyzed  interconnection  network  architectures  that  can  best  exploit  the  lower  average  traffic 
intensity  of  cache-coherent  systems.  Evaluations  with  packet  switched  and  circuit  switched  networks,  assuming 
similar  speeds  for  the  switch  nodes,  show  that  circuit  switching  can  be  superior  to  packet  switching  in  the 
medium  scale  (256-1000  processors).  Our  simulations  with  the  parallel  FORTRAN  traces  also  indicate  that 
directories  yield  better  processor  utilization  than  a  scheme  that  does  not  cache  shared  data. 

We  investigated  the  scalability  of  cache  coherence  schemes.  We  showed  that  these  schemes  can  scale  at 
least  through  64  processors  by  simulations  against  parallel  FORTRAN  traces.  (Memory  limitations  precluded 
our  analyzing  larger  systems.)  We  observed  that  synchronization  references  are  the  chief  impediment  to 
scalability.  We  are  investigating  new  scalable  synchronization  methods  that  do  not  incur  excessive  hardware 
cost. 


We  have  developed  a  new  technique  for  efficient  synchronization  called  adaptive  backoff 
synchronization.  A  purely  software  approach,  adaptive  backoff  synchronization  helps  reduce  network 
contention  due  to  file-grain  synchronization  accesses  across  a  network.  Our  technique  can  help  reduce  hot-spot 
contention  in  large-scale  networks  without  resorting  to  hardware-intensive  solutions  like  combining  networks. 
We  are  also  studying  software  combining  to  determine  the  extent  to  which  a  directory  cache  coherence  scheme 
can  efficiently  support  file-grain  barrier  synchronization. 

Industry  collaborations: 

-  Parallel  FORTRAN  applications,  post-mortem  scheduling, 

and  address  tracing  with  IBM  T.  J.  Watson  Research, 

Yorktown,  NY.  With  Harold  Stone,  Kimming  So,  Scott  Kirkpatrick. 

-  Affinity  scheduling  for  enhancing  multiprocessor  memory 

referencing  locality,  models  of  multiprocessor  referencing, 
software  cache  coherence,  with  DEC  Systems  Research  Center, 

Palo  Alto,  CA.  With  Susan  Owicki. 

-  ATUM-2  Multiprocessor  data  collection  efforts  in  collaboration 

with  DEC,  Hudson,  MA.  With  Dick  Sites. 
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ALGORITHMS 

Prof.  Leighton  is  continuing  his  research  on  networks  and  algorithms  for  parallel  computation.  Recently 
he  has  focussed  on  the  following  specific  problems:  the  development  of  fast  packet  routing  algorithms  for 
commonly  used  fixed-connection  networks  such  as  the  butterfly,  array  and  shuffle-exchange  graph,  the 
development  of  algorithms  to  reconfigure  networks  such  as  the  hypercube  around  faults,  the  development  of 
dynamic  on-line  algorithms  for  embedding  computational  structures  such  as  trees  in  networks  such  as  the 
hypercube  in  a  way  that  balances  computational  load  and  that  minimizes  the  induced  communication  load  on 
the  network,  the  development  of  algorithms  for  emulating  one  kind  of  network  on  another  in  a  way  that 
preserves  the  total  amount  of  work  (processors  x  time)  that  is  done,  and  the  development  of  efficient 
approximation  algorithms  for  a  variety  of  layout  related  problems  such  as  graph  bisection.  The  particular 
advances  that  have  been  made  in  each  of  these  areas  is  briefly  summarized  in  what  follows. 

In  the  area  of  packet  routing,  Prof.  Leighton  and  his  coauthors  have  discovered  the  first  store-and- 
forward  routing  algorithm  which  can  route  nz  packets  in  2n  -  2  steps  on  an  n  x  n  array  with  constant 
size  queues  at  each  node.  They  have  also  discovered  a  more  practical  randomized  routing  algorithm  that  is 
guaranteed  to  have  near-optimal  performance  for  an  array  of  any  dimension  (including  a  hypercube),  the 
butterfly,  and  the  shuffle-exchange  graph.  The  latter  algorithm  also  works  for  many-one  routing  algorithms 
with  combining  and  performs  well  in  heuristic  simulations.  The  details  of  these  and  related  results  have  been 
published.’  1  * 

In  the  area  of  fault-tolerance,  Prof.  Leighton  and  his  coauthors  have  shown  that  a  hypercube  can  tolerate 
a  very  large  number  (a  constant  fraction)  of  randomly  located  faults  without  incurring  more  than  a  constant 
factor  loss  in  performance,  no  matter  how  large  the  hypercube  is.  They  have  also  discovered  simple  algorithms 
for  routing  around  faults  in  the  hypercube  that  are  guaranteed  to  perform  nearly  as  well  as  the  best  routing 
algorithms  when  no  faults  are  present.  The  details  of  this  work  are  described  elsewhere* 

In  the  area  of  network  embeddings  and  scheduling.  Prof  Leighton  and  his  coauthors  have  discovered 
optimal  algorithms  for  embedding  dynamically  growing  and  shrinking  trees  in  a  hypercube  so  that  the 
processing  load  on  the  nodes  of  the  hypercube  is  balanced,  and  so  that  all  communication  links  are  local.  This 
work  has  application  to  the  problem  of  locally  scheduling  the  work  assigned  to  the  processors  of  a  hypercube  in 
a  dynamic  fashion  (i.e.,  as  one  computation  spawns  another,  the  algorithm  determines  the  processor  that  will 
handle  the  new  task).  They  have  also  discovered  optimal  algorithms  for  mapping  code  written  for  one 
architecture  onto  a  different  architecture  in  a  way  that  minimizes  the  total  amount  of  work  required  by  the 
simulating  machine.  These  results  are  described  elsewhere Jl  ”  If  ♦* 

•  T.  Leighton,  B.  Maggs  and  S.  Rao,  ’Universal  Packet  Routing  Algorithms",  IEEE  FOCS,  pp.  256-269, 
October,  1988. 

f  T.  Leighton,  F.  Makedon,  and  I.  Tollis,  “A  2n  -  2  Step  Algorithm  for  Routing  in  an  n  x  n  Array  with 
Constant-size  Queues,”  ACM  SPAA,  to  appear,  June,  1989. 

|  R.  Koch,  “Increasing  the  Size  of  a  Network  by  a  Constant  Factor  Can  Increase  the  Performance  by  More 
Than  a  Constant  Factor,”  IEEE  FOCS,  pp.  221-230,  October,  1988. 

§  J.  Hastad,  T.  Leighton,  and  M.  Newman,  “Fast  Computation  Using  Faulty  Hypercubes,”  ACM  STOC,  to 
appear,  May,  1989. 

||  S.  Bhatt,  F.  Chung,  T.  Leighton,  and  A.  Rosenberg,  “Universal  Graphs  For  Bounded-degree  Trees  and  Planar 
Graphs,”  SIAM  J.  Discrete  Math.,  to  appear,  1989. 

**  R.  Koch,  T.  Leighton,  B.  Maggs,  S.  Rao,  and  A.  Rosenberg,  “Work-preserving  Emulations  of  Fixed- 
connection  Networks,”  ACM  STOC,  to  appear,  May,  1989. 

ft  T.  Leighton,  M.  Newman,  and  E.  Schwabe,  “Dynamic  Embedding  of  Trees  in  Hypercubes  with  Constant 
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Lastly,  in  the  area  of  approximations  algorithms,  Prof  Leighton  and  Satish  Rao  have  discovered  an 
analogue  of  the  max-flow  min-cut  theorem  for  multicommodity  flow  problems  that  can  be  used  to  find  the  first 
good  approximation  algorithms  for  a  wide  variety  of  NP-hard  combinatorial  optimization  problems  such  as 
graph  bisection  minimum  feedback  arc  set.  This  work,  and  some  recent  heuristic  analysis  of  some  related 
algorithms  is  described  elsewhere.*  * 


(continued) 

Dilation  and  Load,”  ACM  SPAA,  to  appear,  June,  1989. 

$  $  S.  Bhatt,  F.  Chung,  T.  Leighton,  and  A.  Rosenberg,  “Efficient  Embedding  of  Trees  in  Hypercubes,” 
submitted  to  SLAM  J.  Computing. 

*  T.  Leighton  and  S.  Rao,  “An  Approximate  Max-flow  Min-cut  Theorem  for  Uniform  Multicommodity  Flow 
Problems  with  Applications  to  Approximation  Algorithms,”  IEEE  FOCS,  pp.  422-431,  October,  1988. 

t  T.Bui,  C.  Heigham,  C.  Jones,  and  T.  Leighton,  “Improving  the  Performance  of  the  Kernighan-Lin  and 
Simulated  Annealing  Graph  Bisection  Algorithms,”  DAC,  to  appear,  June,  1989. 
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APPLICATIONS 

In  the  area  of  applications,  over  the  past  half  year  our  group  has  been  working  on  several  different  types 
of  numerical  algorithms,  both  to  aid  in  the  design  of  an  ARC,  as  well  as  to  uncover  methods  that  could 
effectively  exploit  an  ARC.  In  particular,  our  work  has  focused  on  capacitance  extraction,  parallel  circuit 
simulation,  specialized  circuit  simulation  algorithms  for  clocked  analog  circuits  and  analog  signal  processing 
circuits  for  early  vision,  simulation  of  small  geometry  devices,  and  mixed  circuit/device  simulation. 

A  fast  algorithm  for  computing  the  capacitance  of  a  complicated  3-D  geometry  of  ideal  conductors  in  a 
uniform  dielectric  has  been  developed*.  The  method  is  an  acceleration  of  the  standard  integral  equation 
approach  for  multiconductor  capacitance  extraction.  Integral  equation  methods  can  not  be  applied  easily  to 
large  problems  because  they  lead  to  dense  matrices  which  are  typically  solved  with  some  form  of  Gaussian 
elimination.  This  implies  the  computation  grows  like  n3,  where  n  is  the  number  of  tiles  needed  to  accurately 
discretize  the  conductor  surface  charges.  We  have  developed  a  preconditioned  conjugate-gradient  iterative 
algorithm  with  a  multipole  approximation  to  compute  the  iterates.  This  reduces  the  complexity  of  the 
multiconductor  capacitance  calculations  to  grow  as  n  x  m  where  m  is  the  number  of  conductors. 

In  the  area  of  parallelizing  circuit  and  device  simulation,  the  key  problem  is  finding  efficient  techniques 
for  solving  large  sparse  linear  systems.  The  direct  or  Gaussian-elimination  based  solution  of  circuit  simulation 
matrices  is  particularly  difficult  to  parallelize,  mostly  because  the  data  is  structured  irregularly,  and  methods 
which  attempt  to  regularize  the  structure,  like  nested  dissection,  lead  to  matrices  that  require  much  more 
computation  to  solve.  We  are  investigating  the  interaction  between  sparse  matrix  data  structures,  computer 
memory  structure,  and  multiprocessor  communication  (with  Prof.  W.  Dally).  One  interesting  recent  result 
from  simulations  is  that  final  performance  is  much  more  sensitive  to  communication  throughput  than  latency. 

Another  approach  to  solving  large  sparse  linear  systems  is  through  iteration,  which  is  usually  more 
s:  ructured  but  is  not  as  robust  as  direct  methods.  In  order  to  improve  the  convergence  of  relaxation  methods 
for  circuit  simulation,  an  algorithm  is  being  investigated  that  is  based  on  extracting  bands  from  a  given  sparse 
matrix,  solving  the  bands  directly,  and  relaxing  on  the  rest  of  the  matrix  This  approach  is  efficient  because 
band  matrices  can  be  solved  in  o-der  log  n  time  on  order  n  processors,  and  this  approach  is  more  reliable 
than  standard  relaxation,  because  “less”  relaxation  is  being  used.  A  banded  relaxation  scheme  has  been 
developed  that  automatically  selects  the  ordering  of  the  matrix  to  best  exploit  the  direct  solution  of  the  band, 
and  to  automatically  select  the  band  size* .  In  order  to  increase  the  parallelism  available  from  the  variable 
band  algorithm,  we  are  also  investigating  waveform-Newton,  which  allows  for  several  timepoints  of  a  transient 
simulation  to  be  computed  in  parallel. 

In  the  area  of  circuit  simulation,  the  problem  of  simulating  clocked  analog  circuits,  like  switching  filters, 
switching  power  supplies,  and  phase-locked  loops,  is  being  attacked.  These  circuits  are  computationally 
expensive  to  simulate  using  conventional  techniques  because  they  are  all  clocked  at  a  frequency  whose  period  is 
orders  of  magnitude  smaller  than  the  time  interval  of  interest  to  the  designer.  To  construct  such  a  long  time 
solution,  a  program  like  SPICE  or  ASTAP  must  calculate  the  behavior  of  the  circuit  for  many  high  frequency 
clock  cycles.  Several  very  efficient  algorithms  for  theses  types  of  problems  has  been  developed,  based  on 
computing  the  solution  over  only  a  few  selected  high-frequency  cycles.  In  particular,  techniques  for  computing 


•  K.  Nabors,  J.  White,  “A  Fast  Multipole  Algorithm  for  Capacitance  Extraction  of  Complex  3-D 
Geometries, "Proceedings,  Custom  Integrated  Circuits  Conference ,  San  Diego,  CA,  1989. 

f  A.  Lumsdaine,  D.  Webber,  J.  White,  A.  Sangiovanni-Vincentelli,  “A  Band  Relaxation  Algorithm  for  Reliable 
and  Parallelizable  Circuit  Simulation,”  Proceedings,  International  Conference  on  Computer-Aided  Design , 
Santa  Clara,  CA,  October,  1988. 
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the  transient  behavior  of  switching  power  converters*,  and  computing  the  distortion  of  switched-capadtor 
filters  have  been  developed^.  The  distortion  analysis  algorithm  is  well-suited  to  parallel  computation  as  it 
allows  many  high  frequency  cycles  to  be  integrated  simultaneously. 

A  second  application  for  sperialized  circuit  simulation  algorithms  is  the  simulation  of  analog  signal 
processing  circuits  used  for  early  vision.  These  circuits  are  expensive  to  simulate  with  classical  methods  because 
they  usually  contain  large  grids  of  components  which  must  be  simulated  at  an  analog  level  (i.e.  one  cannot 
perform  simulations  at  a  switch  or  gale  level  as  is  commonly  done  with  very  large  digital  circuits).  Several 
properties  of  the  analog  signal  processing  circuits  can  be  exploited  to  improve  the  efficiency  and  parallelizability 
of  simulation  algorithms.  As  most  of  these  circuits  are  arranged  in  large  regular  grids,  the  computation 
involved  is  like  the  computations  used  to  solve  certain  types  of  partial  differential  equations.  We  expect  this 
research  direction  will  lead  us  to  generalizations  of  certain  types  of  fast  partial  differential  equation  methods, 
and  we  are,  in  particular,  focusing  on  waveform  multigrid  methods. 

In  the  area  of  device  simulation,  we  are  working  on  simulating  short-channel  MOS  devices.  The 
difficulty  is  that  the  model  used  in  conventional  device  simulation  programs  is  based  on  the  drift-diffusion 
model  of  electron  transport,  and  this  model  does  not  accurately  predict  the  field  distribution  near  the  drain  in 
small  geometry  devices.  This  is  of  particular  importance  for  predicting  oxide  breakdown  due  to  penetration  by 
“hot”  electrons.  There  are  two  approaches  for  more  accurately  computing  the  electric  fields  in  MOS  devices, 
one  is  based  on  adding  an  energy  equation  to  the  drift-diffusion  model  and  the  second  is  based  on  particle  or 
Monte-Carlo  simulations. 

In  the  first  approach,  an  energy  balance  equation  is  solved  along  with  the  drift-diffusion  equations  so 
that  the  electron  temperatures  are  computed  accurately.  This  combined  system  is  numerically  less  tame  than 
the  standard  approach,  and  must  be  solved  carefully.  Implementations  of  the  energy  balance  equation  in 
simulators  either  circumvent  this  problem  by  ignoring  difficult  terms,  or  they  occasionally  produce  oscillatory 
results.  Research  in  this  area  is  to  try  to  develop  a  simulation  program  based  on  the  drift-diffusion  plus  energy 
equations  which  is  both  efficient  and  robust.  A  stable  numerical  method  for  1-D  simulation  has  been 
implemented,  and  present  work  is  to  carry  this  forward  to  a  2-D  simulator. 

Work  on  the  second  approach,  solving  the  Boltzmann  equation  with  Monte-Carlo  algorithms,  is  just 
beginning.  We  are  focusing  on  issues  of  the  numerical  interaction  between  the  computation  of  the  self- 
consistent  electric  fields  and  the  simulation  timesteps.  In  addition  we  are  investigating  approaches  which 
parallelize  efficiently. 

Finally,  we  are  continuing  to  investigate  accelerating  mixed  circuit  /device  transient  simulation  with 
waveform  relaxation  (WR),  that  is,  applying  WR  to  the  sparsely-connected  system  of  algebraic  and  ordinary 
differential  equations  in  time  generated  by  standard  spatial  discretization  of  the  drift-diffusion  equations  that 
describe  MOS  devices.  Recent  results  include  proving  an  extension  to  a  result  indicating  that  the  WR  algorithm 
will  converge  in  a  uniform  manner  independent  of  the  time  interval,  and  that  a  multirate  integration  will  be 
stable  independent  of  timestep* .  In  addition,  a  preliminary  2-D  device  simulation  program  based  on  WR  has 
been  written,  and  experiments  on  accelerating  WR  convergence  using  SOR  and  Conjugate-Gradient  methods 
are  in  progress. 


*  K.  Kundert,  J.  White,  A.  Sangiovanni-Vincentelli,  “An  Envelope-Following  Method  for  the  Efficient 
Transient  Simulation  of  Switching  Power  Converters,”  Proceedings,  International  Conference  on 
Computer-Aided  Design,  Santa  Clara,  CA,  October,  1988. 

t  K.  Kundert,  J.  White,  A.  Sangiovanni-Vincentelli,  “A  Mixed  Frequency-Time  Approach  for  Distortion 
Analysis  of  Switched  Capacitor  Filters,”  IEEE  Journal  of  Solid  State  Circuits,  April,  1989. 

|  M.  Crow,  J.  White,  M.  Ilic  “Stability  and  Convergence  Aspects  of  Waveform  Relaxation  Applied  to  Power 
System  Simulation,”  Proceedings,  International  Symposium  on  Circuits  and  Systems,  Portland,  OR,  1989. 
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Research  on  techniques  for  logic  synthesis,  testing  and  design-for-Testability  focuses  on  the  optimization 
of  combinational  and  sequential  circuits  specified  at  the  register-transfer  or  logic  levels  with  area,  performance 
and  testability  of  the  synthesized  circuit  as  design  parameters.  The  research  problems  being  addressed  are: 

•  Area  and  performance  optimization  of  general  sequential  circuits  composed  of  interacting  finite 
state  machines 

•  Test  generation  for  general  sequential  circuits  without  the  restrictions  of  Scan  Design  rules 

•  The  exploration  of  relationships  between  combinational/sequential  logic  s'  nesis  and  testability 
with  a  view  to  the  development  of  techniques  for  the  automatic  synthesis  of  fully  and  easily  testable  circuits. 

Interacting  finite  state  machines  (FSMs)  are  common  in  industrial  chip  designs.  While  optimization 
techniques  for  single  FSMs  are  relatively  well  developed,  the  problem  of  optimization  across  latch  boundaries 
has  received  much  less  attention.  Techniques  to  optimize  pipelined  combinational  logic  so  as  to  improve 
area/throughput  have  been  proposed.  However,  logic  cannot  be  straightforwardly  migrated  across  latch 
boundaries  when  the  basic  blocks  are  sequential  rather  than  combinational  circuits.  We  have  addressed  the 
problem  of  logic  migration  across  state  machine  boundaries  so  as  to  make  particular  machines  less  complex  at 
the  possible  expense  of  making  others  more  complex.*  This  can  be  useful  from  both  an  area  and 
performance  point  of  view.  Optimization  algorithms,  based  cm  automata-theoretic  decomposition  techniques, 
that  incrementally  modify  state  machine  structures  across  latch  boundaries,  so  as  to  improve  area  or  throughput 
of  a  sequential  circuit,  have  been  developed.  We  are  now  looking  toward  developing  more  global  techniques  for 
logic  migration  in  sequential  circuits. 

Interacting  sequential  circuits  can  be  optimized  by  specifying  and  exploiting  the  don’t  care  conditions 
that  occur  the  boundaries  of  the  different  machines.  While  the  specification  of  don’t  care  conditions  for 
interconnected  combinational  circuits  is  a  well-understood  problem,  the  corresponding  sequential  circuit 
problem  has  received  very  little  attention.  We  have  defined  a  complete  set  of  don’t  cares  associated  with 
arbitrary,  interconnected  sequential  machines.  These  sequential  don’t  cares  represent  both  single  vectors  and 
sequences  of  vectors  that  never  occur  at  latch  boundaries.  Exploiting  these  don’t  cares  can  result  in  significant 
reductions  in  the  number  of  states  and  complexities  of  the  individual  FSMs  in  a  distributed  specification.  We 
have  developed  algorithms  for  the  systematic  exploitation  of  these  don’t  cares*  and  are  currently  improving 
the  performance  of  these  algorithms. 

Optimization  of  single  or  lumped  FSMs  has  been  the  subject  of  a  great  deal  of  research.  Optimal  state 
assignment  and  FSM  decomposition  are  critical  to  the  synthesis  of  area-efficient  logic  circuits. 

The  problem  of  FSM  decomposition  entails  decomposing  a  machine  into  interacting  submachines  so  as 
to  improve  area  or  performance  of  the  circuit.  We  have  developed  new  decomposition  techniques  based  on 
factorization  of  sequential  machines.*  ,*  This  form  of  optimization  involves  identifying  subroutines  or  factors 
in  the  original  machine  and  extracting  these  factors  to  produce  factored  and  factoring  machines.  Factorization 
can  result  in  submachines  which  are  smaller  and  faster  than  the  original  machine.  Experimental  results  indicate 


•  S.  Devadas,  “Approaches  to  Multi-Level  Sequential  Logic  Synthesis,”  Proceedings  of  26th  Design  Automation 
Conference,  Las  Vegas,  NV,  June,  1989. 

f  S.  Devadas,  “Approaches  to  Multi-Level  Sequential  Logic  Synthesis,”  Proceedings  of  26th  Design  Automation 
Conference,  Las  Vegas,  NV,  June,  1989. 

$  S.  Devadas  and  A.  R.  Newton,  “Decomposition  and  Factorization  of  Sequential  Finite  State  Machines,” 
Proceedings,  International  Conference  on  Computer-Aided  Design,  Santa  Clara,  CA,  November,  1988. 

§  S.  Devadas,  “General  Decomposition  of  Sequential  Machines:  Relationships  to  State  Assignment,” 
Proceedings,  26th  Design  Automation  Conference,  Las  Vegas,  NM,  June,  1989. 
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that  factorization  compares  favorably  to  other  techniques  for  FSM  decomposition.  We  are  also  currently 
exploring  the  relationships  between  factorization  and  the  optimal  state  assignment  problem. 

The  problem  of  optimal  state  assignment  entails  finding  an  optimal  binary  encoding  of  the  states  in  a 
FSM,  so  the  encoded  and  minimized  FSM  has  minimum  area.  All  previous  automatic  approaches  to  state 
encoding  and  assignment  have  involved  the  use  of  heuristic  techniques.  Other  than  the  straightforward, 
exhaustive  search  procedure,  no  exact  solution  methods  have  been  proposed.  A  straightforward,  exhaustive 
search  procedure  requires  0(N !  )  exact  Boolean  minimizations,  where  N  as  the  number  of  symbolic  states. 
We  have  discovered  a  new  minimization  procedure*  for  multiple-valued  input  and  multiple-valued  output 
functions  that  represents  an  exact  state  assignment  algorithm.  The  present  state  and  next  state  spaces  of  the 
State  Transition  Graph  of  a  FSM  are  treated  as  multiple-valued  variables,  taking  on  as  many  values  are  there 
are  states  in  the  machine.  The  minimization  procedure  involves  constrained  prime  implicant  generation  and 
covering  and  operates  on  multiple-valued  input,  multiple-valued  output  functions.  If  a  minimum  set  of  prime 
.  implicants  is  selected,  an  minimum  solution  to  the  state  assignment  problem  is  obtained.  While  our  covering 

problem  is  more  complex  than  the  classic  unate  covering  problem  of  two-level  Boolean  minimization,  a  single 
logic  minimization  step  replaces  0(N!)  minimizations.  We  are  currently  evaluating  the  performance  of  this 
exact  algorithm  and  developing  computationally-efficient  heuristic  state  assignment  strategies  based  on  the 
exact  algorithm. 

The  problem  of  four-level  Boolean  minimization  or  the  problem  of  finding  a  cascaded  pair  of  two-level 
logic  functions  that  implement  another  logic  function,  such  that  the  sum  of  the  product  terms  in  the  two 
cascaded  functions  or  truth-tables  is  minimum,  can  also  be  mapped  onto  an  encoding  problem,  similar  to  state 
assignment.  We  have  extended  the  exact  state  encoding  algorithm  to  the  four-level  Boolean  minimization  case. 

|  After  chip  fabrication,  a  chip  has  to  be  tested  for  correct  functionality.  Logic  testing  is  a  very  difficult 

problem  and  has  traditionally  been  a  post-design  step;  however,  the  impact  of  the  design  or  synthesis  process  on 
the  testability  of  the  circuit  is  very  profound. 

Our  research  in  the  testing  area  involves  test  pattern  generation  for  sequential  circuits  as  well  as  the 
development  of  synthesis-for-testability  approaches  for  combinational  and  sequential  circuits.  Highly  sequential 
circuits,  like  datapaths,  are  not  amenable  to  standard  test  pattern  generation  techniques.  We  are  attempting  to 
develop  algorithms  that  are  efficient  in  generating  tests  for  datapath-like  circuits,  by  exploiting  knowledge  of 
both  the  sequential  behavior  and  the  logic  structure  of  the  logic  circuit. 

1 

Recently,  there  has  been  an  explosion  of  interest  in  incorporating  testability  measures  in  logic  synthesis 
techniques.  Our  research  follows  the  paradigm  that  redundancy  in  a  circuit,  which  renders  a  circuit  untestable, 

is  the  result  of  a  sub-optimal  logic  synthesis  step.  Thus,  optimal  logic  synthesis  can,  in  principle,  ensure 
fully  testable  combinational  or  sequential  logic  designs. 

The  relationships  between  redundant  logic  and  don’t  care  conditions  in  combinational  circuits  are  well 
known.  Redundancies  in  a  combinational  circuit  can  be  explicitly  identified  using  test  generation  algorithms  or 
implicitly  eliminated  by  specifying  don’t  cares  for  each  gate  in  the  combinational  network  and  minimizing  the 
gates,  subject  to  the  don’t  care  conditions.  We  have  explored  the  relationships  between  redundant  logic  and 
don’t  care  conditions  in  arbitrary,  interacting  sequential  circuits.* }  Stuck-at  faults  in  a  sequential  circuit 


*  S.  Devadas  and  A.  R.  Newton,  “Exact  Algorithms  for  Output  Encoding,  State  Assignment  and  Four-Level 
Boolean  Minimization,”  Electronics  Research  Laboratory  Memorandum  M89/8,  University  of 
California,  Berkeley,  February,  1989. 

f  S.  Devadas  et.  al.,  “Irredundant  Sequential  Machines  Via  Optimal  Logic  Synthesis,”  Electronics  Research 
Laboratory  Memorandum  M88/52,  University  of  California,  Berkeley,  August,  1988. 

|  S.  Devadas  et.  al.,  “Redundancies  and  Don’t  Cares  in  Sequential  Logic  Synthesis,”  in  preparation. 
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may  be  testable  in  the  combinational  sense,  but  may  be  redundant  because  they  do  not  alter  the  terminal 
behavior  of  a  non-scan  sequential  machine.  These  sequential  redundancies  result  in  a  faulty  State  Transition 
Graph  (STG)  that  is  equivalent  to  the  STG  of  the  true  machine.  We  have  classified  all  possible  kinds  of 
redundant  faults  in  sequential  circuits,  composed  of  single  or  interacting  finite  state  machines.  For  each  of  the 
different  classes  of  redundancies,  we  define  don't  care  sets  which  if  optimally  exploited  will  result  in  the  implicit 
elimination  of  any  such  redundancies  in  a  given  circuit.  We  have  shown  that  the  exploitation  of  sequential  don’t 
cares  that  correspond  to  sequences  of  vectors  that  never  appear  in  cascaded  or  interacting  sequential  circuits,  is 
critically  necessary  in  the  synthesis  of  irredundant  circuits.  Using  a  complete  don't  care  set  in  an  optimal 
sequential  synthesis  procedure  of  state  minimization,  state  assignment  and  combinational  logic  optimization 
results  in  fully  testable,  lumped  or  interacting  finite  state  machines.  Preliminary  experimental  results  indicate 
that  irredundant  sequential  circuits  can  be  synthesized  with  no  area  overhead  and  within  reasonable  CPU  times 
by  exploiting  these  don’t  cares. 

Procedures  that  guarantee  easy  testability  of  sequential  machines  via  constraints  on  the  optimization 
steps  are  also  a  subject  of  research.  These  procedures  address  both  the  testability  of  circuits  under  the  stuck-at 
fault  and  the  crosspoint  fault  model.  These  procedures  may  result  in  circuits  that  are  larger  than  area-minimal 
circuits,  but  which  are  more  easily  testable. 

The  different  pieces  of  the  research  described  above  are  all  focused  on  an  algorithmic  approach  for  the 
optimal  synthesis  of  custom  integrated  circuit  chips  with  area,  performance  and  testability  as  design  parameters. 
The  various  techniques  can  be  incorporated  into  an  ASIC  synthesis  system. 
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Invited  Presentation 

Ten  years  ago  at  this  conference,  Clark  Thompson  introduced  a  simple, 
graph-theoretic  model  for  VLSI  circuitry  [22].  In  Thompson's  model,  a  circuit 
is  a  graph  whose  vertices  correspond  to  active  circuit  elements  and  whose  edges 
correspond  to  wires.  A  VLSI  layout  is  a  mapping  of  the  graph  to  a  two-dimensional 
grid,  such  that  each  vertex  is  mapped  to  a  square  region  of  the  grid  and  each  edge 
is  mapped  to  a  path  in  the  grid.  Unlike  the  classical  notions  of  a  graph  embedding 
from  mathematics,  Thompson's  model  allows  edges  of  a  graph  to  cross  over  one 
another,  like  wires  on  an  integrated  circuit. 

The  interesting  cost  measure  in  VLSI  is  area.  In  Thompson’s  model,  area  can 
be  measure  as  the  number  of  grid  points  occupied  by  edges  or  vertices  of  the 
graph.  Quickly,  the  minimum-area  layouts  for  familiar  graphs  were  catalogued. 
As  shown  in  Figure  1,  a  mesh  (two-dimensional  array)  with  n  vertices  (y/n  by  y/n) 
has  0(n)  area.1  The  normal  way  of  drawing  a  complete  binary  tree  (Figure  2a)  has 
0(n  Ig  n)  area,  but  the  “H-tree"  layout  (Figure  2b)  is  much  better:  it  has  0(n)  area. 
A  hvpercube.  which  is  a  popular  interconnection  network  for  parallel  computers, 
requires  consi  derably  more  area — 0(n2). 

What  causes  a  hvpercube  to  occupy  so  much  area?  Although  the  size  of  a 
vertex  grows  slowly  with  the  number  of  vertices  in  a  hypercube,  most  of  the  area  of  a 
hvpercube  layout  is  devoted  to  wires.  Figure  3  shows  how  the  the  problem  of  wiring 
a  hvpercube  grows  with  the  size  of  the  hvpercube.  Wires  are  expensive,  and  wire 
area  represents  the  capital  cost  of  communication  on  a  VLSI  chip.  By  measuring 
communication  costs  in  terms  of  the  geometric  concept  of  area,  Thompson's  model 
enabled  a  mathematical  theory  of  communication  in  VLSI  systems  to  develop. 

From  its  origin.  VLSI  theory  has  expanded  in  many  fruitful  and  interesting 
directions.  Rather  than  attempting  to  describe  the  breadth  of  research  in  VLSI 
theory,  however,  I  would  like  to  revisit  the  accomplishments  along  one  narrow 

lThe  notation  0(/(n))  means  a  function  that  grows  at  the  same  rate  as  f(n )  to  within  a 
constant  factor  as  n  becomes  large.  The  notation  0(f(n))  means  a  function  that  grows  no  more 
quickly,  and  f l(/(n))  means  a  function  that  grows  no  more  slowly.  Formal  definitions  for  these 
terms  can  be  found  in  any  textbook  on  analysis  of  computer  algorithms. 


Figure  3:  Illustrations  (not  layouts)  of  hypercubes  on  4,  8,  16,  and  32  vertices.  Any 
layout  of  an  n-vertex  hypercube  requries  f l(n2)  area. 

path — layout  theory — which  I  believe  will  have  a  fundamental  impact  on  the 
architecture  of  large  parallel  supercomputers. 

In  his  early  work.  Thompson  discovered  an  important  lower  bound.  The  area  of 
an  n-vertex  graph  is  related  to  its  bisection  width:  the  minimum  number  of  edges 
that  must  be  removed  to  partition  the  graph  into  two  subgraphs  of  n/2  vertices 
(to  within  1,  if  the  number  of  vertices  is  odd).  For  example,  an  n-vertex  mesh  has 
a  bisection  width  of  -y/n.  A  complete  binary  tree  has  a  bisection  width  of  1.  A 
hypercube  has  a  bisection  width  of  n/2.  Thompson  proved  that  any  layout  of  a 
graph  with  bisection  width  w  requires  ft(u>2)  area. 

It  turns  out  that  a  small  bisection  width  does  not  lead  immediately  to  a 
small-area  layout.  After  all,  if  we  take  two  n/ 2-vertex  subgraphs,  each  with  0(n2) 
area,  and  connect  them  by  a  single  edge,  the  resulting  graph  has  a  bisection  width  of 
1  but  still  requires  0(n2)  area.  Leslie  Valiant  and  I  were  able  to  show  in  independent 
work  [24,  14,  15],  however,  that  if  there  is  a  good  recursive  decomposition  of  a 
graph — one  where  we  can  keep  subdividing  the  subgraphs  without  cutting  many 
edges — then  the  graph  has  a  small  layout.  For  example,  not  only  complete  binary 
trees,  but  any  binary  tree,  no  matter  how  badly  balanced,  can  be  laid  out  in  0(n ) 
area  by  a  divide-and-conquer  method.  Valiant  and  I  were  also  able  to  show  that 
this  method  lays  out  any  n-vertex  planar  graph  in  0(nlg2  n)  area.  Later,  Leighton 
was  able  to  show  that  a  variant  of  our  method  was  optimal  on  any  graph  to  within 
a  0(lgJn)  factor  in  area  [9]. 

Leighton  also  introduced  an  interesting  graph  which  he  called  the  tree-of-meshes 
graph,  shown  in  Figure  4.  He  was  able  to  prove  that  this  graph  requires  fl(nlgn) 
area,  thereby  refuting  a  conjecture  of  mine  that  all  planar  graphs  could  be  laid  out 


Figure  4:  The  tree-of- meshes  graph. 


in  O(n)  area.  It  remains  an  open  question  in  VLSI  theory  as  to  whether  there  exists 
a  planar  graph  that  requires  Q(nlg2  n)  area,  or  if  all  planar  graphs  can  be  laid  out 
in  O(nlgn)  area. 

Numerous  other  results  in  layout  theory  have  been  obtained — too  many  to 
mention  them  all.  Paterson,  Ruzzo,  and  Snyder  [18]  and  Bhatt  and  Leiserson 
[3]  studied  how  to  keep  wires  short  while  preserving  small  area.  Valiant  [24],  Ruzzo 
and  Snyder  [21],  and  Dolev,  Leighton,  and  Trickey  [4]  studied  VLSI  layouts  in 
which  wires  are  not  allowed  to  cross.  Three-dimensional  integration  was  studied 
by  Rosenberg  [20],  Leighton  and  Rosenberg  [13],  and  Greenberg  and  Leiserson  [5]. 
Fault  tolerance  in  wafer-set -e  circuits  was  studied  by  Rosenberg  [19],  Leighton  and 
Leiserson  [11,  10],  and  Greene  and  El  Gamal  [7].  The  packaging  of  graphs  into 
chips  was  studied  by  Leiserson  [15]  and  Bhatt  and  Leiserson  [2]. 

In  fact,  packaging  constraints  are  analogous  to  the  constraints  in  Thompson’s 
model.  At  any  level  of  packaging — chips,  boards,  backplanes,  racks,  or  cabinets — 
manufacturing  technology  constrains  the  number  of  external  connections  from  a 
package  to  be  much  smaller  than  the  number  of  components  within  the  package.  In 
Thompson's  model,  a  square  region  with  side  s  can  support  4s  external  connections, 
but  it  can  contain  s2  vertices,  which  is  considerably  larger  than  4s  as  s  becomes 
large. 


Figure  5:  Packaging  a  complete  binary  tree. 

As  an  example  of  a  result  [15]  in  packaging,  Figure  5  shows  a  novel  way  to 
package  a  complete  binary  tree  using  4-pin  packages  of  a  single  type.  Each  chip 
contains  one  internal  node  of  the  tree,  with  three  external  connections,  and  the 
remainder  of  the  chip  is  packed  as  full  as  possible  with  a  complete  binary  tree,  with 
one  external  connection.  To  assemble  a  tree  with  twice  as  many  leaves,  we  use  two 
chips.  We  wire  up  one  of  the  unconnected  internal  nodes  on  one  of  the  chips  as  the 
parent  of  the  two  complete  binary  trees.  We  are  left  with  a  complete  binary  tree 
with  twice  as  many  leaves,  plus  one  unconnected  internal  node.  Thus,  considering 
the  two  chips  as  a  single  unit,  the  structure  is  the  same  as  the  one  with  which  we 
began.  By  repeating  the  process,  we  can  recursively  assemble  a  complete  binary 
tree  of  arbitrarily  large  size. 

The  work  in  layout  theory  culminated  with  the  development  by  Bhatt  and 
Leighton  [1]  of  a  general  framework  for  VLSI  layout.  They  proposed  a  layout 
method  with  which  they  were  able  to  obtain  optimal  or  near-optimal  layouts  for 
many  graph-embedding  problems.  Their  method  has  three  steps.  First,  recursively 
bisect  the  graph,  forming  a  decomposition  tree  of  the  graph.  Second,  embed  the 
graph  in  the  tree-of- meshes  graph  (Figure  4),  typically,  with  the  vertices  of  the 
graph  at  the  leaves  of  the  tree-of-meshes  graph.  The  meshes  in  the  tree-of- meshes 
are  used  as  crossbar  switches  for  routing  the  edges  of  the  graph.  The  layout  of 
the  graph  is  then  obtained  by  looking  at  where  the  vertices  and  edges  are  mapped 
when  the  the  tree-of-meshes  graph  is  laid  out  according  to  known  good  layouts. 

It  seemed  to  me  at  the  time  that  Bhatt  and  Leighton  had  solved  nearly  all  the 
interesting  open  problems  in  VLSI  layout  theory.  All  new  results  in  the  area  would 
be  little  more  than  refinements  of  existing  methods  with  no  more  real  insights  into 
the  nature  of  interconnectivity.  I  turned  my  attention  toward  parallel  computation, 
in  which  I  bad  continued  to  be  involved  since  my  work  with  H.  T.  Kung  on  systolic 
arrays  [8]. 

In  fact,  I  was  very  much  a  proponent  of  special-purpose  parallel  computation 
over  general-purpose  parallel  computation,  largely  as  a  result  of  my  work  on  VLSI 


layout  theory.  After  all.  as  Rung  and  I  had  shown,  and  as  Kung  has  continued 
to  forcefully  demonstrate,  many  computations  can  be  performed  efficiently  on 
simple  linear-area  structures  such  as  one  and  two-dimensional  arrays.  These 
special-purpose  networks  have  the  nice  property  that  they  can  be  laid  out  so 
that  processors  are  dense  and  packaging  costs  are  minimized.  Moreover,  for  many 
problems,  they  offer  speedup  which  is  linear  in  the  number  of  processors  in  the 
systolic  array. 

General-purpose  parallel  computers,  on  the  other  hand,  are  typically  based 
on  interconnection  networks,  such  as  hypercubes,  that  are  very  costly  for  the 
computation  they  provide.  For  example,  any  hypercube  network  embedded  in 
area  A  has  at  most  0(y/A)  processors.  The  processors  axe  therefore  sparse  in  the 
embedding,  and  connections  dominate  the  cost.  Similar  results  can  be  shown  for  a 
three-dimensional  VLSI  model.  Only  0(V 2/<3)  processors  of  a  hypercube  network 
can  fit  in  a  volume  V. 

Hypercube  networks  do  have  a  major  advantage  over  many  other  networks  for 
parallel  computing,  however.  They  are  universal :  a  hypercube  on  n  processors  can 
simulate  any  n-processor  bounded- degree  network  in  O(lgn)  time.  The  simulation 
overhead  is  polylogarithmic  (a  polynomial  of  lg  n),  an  indication  that  the  simulation 
is  a  parallel  simulation.  A  polynomial  overhead  in  simulation  is  less  interesting, 
since  0(n)  overhead  is  easily  obtained  by  a  serial  processor  simulating  each  of  the 
n  processors  in  turn. 

The  proof  that  an  n-processor  hvpercube  is  universal  goes  roughly  as  follows. 
Suppose  we  have  a  bounded- degree  network  R  with  n  processors.  Each  processor 
can  communicate  with  all  its  neighbors  in  unit  time.  The  hvpercube  can  simulate 
the  network,  therefore,  by  sending  at  most  a  constant  number  of  messages  from  each 
processor,  where  each  message  contains  the  information  that  travels  on  one  of  the 
interconnections  in  R.  It  turns  out,  all  messages  can  be  routed  on  the  hypercube 
to  their  destinations  in  O(lgn)  time  [23]. 

The  notion  of  universality — the  ability  of  one  machine  to  efficiently  simulate 
every  machine  in  a  class — is  central  to  the  origins  of  computer  science.  A  universal 
machine  is  the  computer  theorist’s  idea  of  a  general-purpose,  as  opposed  to 
multipurpose,  machine.  A  universal  machine  can  do  the  function  of  any  machine, 
just  by  programming  it,  or,  in  the  case  of  parallel-processing  networks,  just  by- 
routing  messages.  A  universal  machine  may  not  be  the  best  machine  for  any 
given  job,  but  it  is  never  much  worse  than  the  best.  The  universality  theorem 
for  hypercubes  does  not  say  that  a  hypercube  is  the  fastest  network  to  build  on  n 
processors.  What  it  says  is  that  the  fastest  special-purpose  network  for  any  given 
problem  can’t  be  much  faster. 

From  a  VLSI  theory  standpoint,  however,  a  special-purpose  parallel  machine  has 
a  clear  advantage  over  a  universal  parallel  machine.  Packaging  its  network  can  cost 
much  less.  And  although  universality  is  a  selling  point,  our  economy  favors  machines 
that  are  cheap  and  efficient,  even  if  they  are  not  universal.  (How  many  combination 


telephone-lawnmower-toothbrushes  have  been  sold  recently?)  Special-purpose 
networks  for  parallel  computation  are  much  cheaper  than  hypercube  networks. 
Thus,  for  a  long  time,  I  was  skeptical  about  the  cost-effectiveness  of  general-purpose 
parallel  computing. 

I  changed  my  mind,  however,  and  became  an  advocate  general-purpose  parallel 
computing  when  I  started  to  look  more  closely  at  the  traditional  assumptions 
concerning  universal  networks.  In  fact,  from  a  VLSI  theory  perspective,  I  discovered 
that  hvpercubes  are  not  really  “universal”  at  all!  An  n-processor  hypercube  may 
be  able  to  efficiently  simulate  any  n-processor  bounded-degree  network,  but  if  we 
normalize  by  area  instead  of  by  number  of  processors,  we  discover  that  an  area-A 
hypercube  cannot  simulate  all  area-A  networks  efficiently.  For  example,  since  an 
area-A  hvpercube  has  only  ©(v^A)  processors,  it  can’t  simulate  an  area-A  mesh, 
which  has  0(A)  processors,  m  polylogarithmic  time.  A  network  that  is  universal 
from  a  VLSI  point  of  view  should  be  a  network  that  for  a  given  area  can  efficiently 
simulate  any  other  network  of  comparable  area. 

One  such  area-universal  network  is  a  fat-tree  [16, 6],  which  is  based  on  Leighton’s 
tree-of-meshes  graph.  As  shown  in  Figure  6,  processors  occupy  the  leaves  of  the  tree, 
and  the  meshes  are  replaced  with  switches.  Unlike  a  computer  scientist's  traditional 
notion  of  a  tree,  a  fat-tree  is  more  like  a  real  tree  in  that  it  gets  thicker  further 
from  the  leaves.  Local  messages  can  be  routed  within  subtrees,  like  phone  calls  in  a 
telephone  exchange,  thereby  requiring  no  bandwidth  higher  in  the  tree.  The  number 
of  external  connections  from  a  subtree  with  m  processors  is  proportional  to  y/m. 
which  is  the  perimeter  of  a  region  of  area  m.  The  area  of  the  network  is  0(n  lg2  n), 
which  is  nearly  linear  in  the  number  n  of  processors.  Thus,  the  processors  are 
packed  densely  in  the  layout. 

Any  network  R  that  fits  in  a  square  of  area  n  can  be  efficiently  simulated  by 
an  area-universal  fat-tree  on  n  processors.  To  perform  the  simulation,  we  ignore 
the  wires  in  R  and  map  the  processors  of  R  to  the  processors  of  the  fat-tree  in  the 
natural  geometric  way,  as  shown  in  Figure  7.  As  in  the  hypercube  simulation,  each 
wire  of  R  is  replaced  by  a  message  in  the  fat-tree.  If  we  look  at  any  m-processor 
subtree  of  the  fat-tree,  it  simulates  at  most  a  region  of  area  m  in  the  layout  of  R. 
The  number  of  wires  that  can  leave  this  area-m  region  in  R's  layout  is  0(y/m), 
and  the  fat-tree  channel  connecting  to  the  root  of  the  subtree  has  wires. 

Thus,  the  load  factor  of  the  channel,  the  ratio  of  the  number  of  messages  to  channel 
bandwidth,  is  0(1).  It  turns  out  that  there  are  routing  algorithms  [16,  6,  12]  that 
effectively  guarantee  that  all  messages  are  delivered  in  polylogarithmic  time.  (In 
fact,  th*  algorithms  can  deliver  messages  near  optimally  even  if  the  load  factor  is 
quite  large.) 

Similar  universality  theorems  can  be  proved  for  three-dimpnsional  VLSI  models 
using  volume-universal  fat-trees.  For  a  fat-tree  to  be  universal  for  volume,  however, 
the  channel  capacities  must  be  selected  differently  from  those  in  an  area- universal 
network.  Whereas  the  average  growth  rate  of  channels  in  an  area-universal  fat-tree 


Figure  6:  An  area-universal  fat-tree. 


is  V2.  the  average  growth  rate  in  a  volume-universal  fat-tree  is  v'X 

In  practice,  of  course,  no  mathematical  rule  governs  interconnect  technology. 
Most  networks  that  have  been  proposed  for  parallel  processing,  such  as  meshes 
and  hyper  cubes,  are  inflexible  when  it  comes  to  adapting  their  topologies 
to  the  arbitrary  bandwidths  provided  by  packaging  technology.  The  growth 
in  channel  bandwidth  of  a  fat-tree,  however,  is  not  constrained  to  follow  a 
prescribed  mathematical  formula.  The  channels  of  a  fat-tree  can  be  adapted  to 
effectively  utilize  whatever  bandwidths  the  technology  can  provide  and  which  make 
engineering  sense  in  terms  of  cost  and  performance.  Figure  8  shows  one  variant 
of  a  fat-tree  composed  of  two  kinds  of  small  switches:  a  three-connection  switch 


Figure  7:  Any  area-n  network  R  can  be  efficiently  simulated  by  an  n-processor 
area- universal  fat-tree. 


Figure  8:  A  scalable  fat-tree. 


and  a  four- connect  ion  switch.  By  choosing  one  of  these  two  kinds  of  switches 
at  each  level  of  the  fat-tree,  the  bandwidths  of  channels  can  be  adjusted.  If  the 
three-connection  switch  is  always  selected,  an  ordinary  complete  binary  tree  results. 
If  the  four-connection  switch  is  always  selected,  a  butterfly  network,  which  is  a 
relative  of  a  hypercube,  results.  By  suitably  mixing  these  two  kinds  of  switches, 
a  fat-tree  that  falls  between  these  two  extremes  can  be  constructed  that  closely 
matches  the  the  bandwidths  provided  by  the  interconnect  technology. 

The  notion  of  locality  exploited  by  fat- trees  is  but  one  of  three  such  notions 
that  arise  in  the  engineering  of  a  parallel  computer.  The  most  basic  notion  of 
locality  is  exemplified  by  wire  delay  and  measured  in  distance.  Communication 
is  speed-of-light  limited.  If  this  notion  of  locality  dominates,  the  nearest-neighbor 
communication  provided  by  a  three-dimensional  mesh  is  the  best  one  can  hope. 
For  many  systems,  however,  wire  delay  is  dominated  by  the  time  it  takes  for  logic 
circuits  to  compute  their  functions.  The  second  notion  of  locality  is  exemplified 
'by  levels  of  logic  circuits  and  measured  in  gate  delays.  Communication  time  is 
essentially  limited  by  the  number  of  switches  a  message  passes  through.  From  this 
point  of  view,  structures  with  small  diameters,  such  as  hypercubes,  seem  ideal. 
In  a  routing  network,  however,  a  heavy  load  of  messages  can  cause  congestion, 
and  the  time  it  takes  to  resolve  this  congestion  can  dominate  both  wire  and  gate 
delays.  Congestion  is  especially  likely  to  occur  in  networks  that  make  efficient  use  of 
packaging  technology.  The  last  notion  of  locality  is  exemplified  by  the  congestion  of 
messages  leaving  a  subsystem  and  measured  by  load  factor.  From  this  standpoint, 


fat-trees  offer  provably  good  performance  by  a  general-purpose  network  that  can  be 
packaged  efficiently.  Recent  work  [17]  has  shown  that  efficient  parallel  algorithms 
can  be  designed  for  this  kind  of  network,  as  well. 

Whatever  the  point  of  view,  however,  all  three  notions  of  locality  must  guide  the 
engineering  and  programming  of  very  large  machines.  There  are  problems  in  the 
sciences  that  cry  out  for  massive  amounts  of  computation,  most  of  which  exhibit 
locality  naturally:  problems  in  astronomy,  such  as  galaxy  simulation;  problems 
in  biology,  such  as  the  combinatorics  of  DNA  sequencing;  problems  in  economics, 
such  as  market  prediction;  problems  in  aerospace,  such  as  fluid-flow  simulation; 
problems  in  earth,  atmospheric,  and  ocean  sciences,  such  as  earthquake  and  weather 
prediction.  To  address  these  problems  effectively,  very  large  parallel  computers 
must  be  constructed.  Some  of  these  computers  may  even  be  “building  sized.”  To 
construct  and  program  such  large  machines,  however,  locality  must  be  exploited, 
and  computer  engineers  must  come  to  grips  with  the  lessons  of  VLSI  theory. 
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ABSTRACT 

Concurrent  computing  is  fundamentally  different  than 
sequential  computing.  Task  size  is  orders  of  magnitude 
smaller  making  synchronization  and  scheduling  major  con¬ 
cerns,  the  critical  resources  are  communication  and  mem¬ 
ory,  and  programs  distribute  tasks  rather  than  looping. 
Conventional  hardware  and  operating  system  mechanisms 
are  highly  evolved  for  sequential  computing  and  are  not 
appropriate  for  concurrent  systems.  This  position  paper 
examines  the  mechanisms  required  by  concurrent  systems 
and  the  structure  of  a  system  incorporating  these  mech¬ 
anisms. 

1  FUNDAMENTAL  PROBLEMS 

1.1  Primitive  Mechanisms 

A  fundamental  hardware  problem  is  to  identify  a  set  of 
primitive  mechanisms  that  efficiently  support  a  broad 
range  of  concurrent  execution  models.  Sequential  ma¬ 
chines  have  evolved  stacks  for  memory  allocation,  pag¬ 
ing  for  memory  management,  and  program  counters  for 
instruction  sequencing.  Concurrent  machines  have  very 
different  demands  in  each  of  these  areas;  the  sequential 
mechanisms  are  no  longer  appropriate.  However,  no  con¬ 
current  mechanisms  have  yet  evolved  to  take  their  places. 
Today’s  concurrent  computers  either  interpret  their  ex¬ 
ecution  model  using  sequential  mechanisms  or  are  hard¬ 
wired  for  a  single  execution  model. 

The  message-driven  processor  (MDP)  [4]  [6]  is  designed 
to  evaluate  concurrent  execution  mechanisms  for  com¬ 
munication,  synchronization,  and  naming.  A  SEND  in¬ 
struction  and  hardware  message  reception  and  buffering 
allow  efficient  communication  of  short  messages  across  a 
high-speed  network  (7).  Synchronization  is  supported  by 
a  dispatch  mechanism  that  creates  a  new  process  to  han¬ 
dle  a  message  in  a  single  clock  cycle.  A  general  purpose 
translation  mechanism  supports  naming.  These  mech¬ 
anisms  provide  the  primitive  support  required  by  many 
concurrent  models  of  computation  including  dataflow  [9], 
actors(l),  and  communicating  processes  (11). 


1.2  Resource  Management 

At  the  operating  system  level,  a  key  problem  is  to  develop 
resource  management  techniques  suitable  for  concurrent 
systems. .  In  a  concurrent  system,  communication  band¬ 
width  and  memory  capacity  are  the  limiting  resources; 
processor  cycles  are  almost  free.  This  situation  is  the 
opposite  of  the  sequential  case  where  processor  cycles 
are  considered  the  critical  resource  and  communication  is 
not  a  consideration.  To  complicate  the  situation  the  re¬ 
sources  are  physically  distributed.  Objects  and  processes 
must  be  placed  in  a  manner  that  balances  memory  and 
processor  use  across  the  machine  and  reduces  communi¬ 
cation.  The  JOSS  operating  system  [14]  [15]  is  designed 
to  satisfy  these  unconventional  requirements. 

Methods  must  also  be  developed  to  regulate  concurrency. 
Many  programs  have  too  much  parallelism  and  thus  gen¬ 
erate  more  tasks  than  can  be  accommodated  in  the  avail¬ 
able  memory.  To  avoid  the  resulting  deadlock,  the  sys¬ 
tem  must  regulate  programs  allowing  them  to  generate 
sufficient  concurrency  to  make  use  of  all  available  pro¬ 
cessors,  but  reverting  to  more  sequential  execution  be¬ 
fore  exhausting  memory.  Examples  of  regulation  include 
controlled  unrolling  of  loops  [2]  and  adaptive  (FIFO  vs 
LIFO)  scheduling  [10],  f 

To  make  efficient  use  of  the  communication  resources, 
memory  and  tasks  must  be  allocated  in  a  manner  that  ex¬ 
ploits  locality.  Placing  objects  near  each  other  to  improve 
locality  is  often  at  odds  with  the  need  to  distribute  ob¬ 
jects  for  load  balancing.  Also  there  are  some  cases  where 
communication  bandwidth  can  be  increased  by  spread¬ 
ing  out  a  computation  to  make  more  channels  available. 
For  static  computations  min-cut  placement  techniques 
similar  to  those  used  to  place  electronic  components  [12] 
work  well.  Dynamic  computations  rely  heavily  on  heuris¬ 
tics  (e.g.,  placing  an  object  near  the  object  that  created 
it)  supplemented  by  reactive  load  balancing. 

1.3  Overhead 

To  make  use  of  a  computer  with  thousands  of  processors, 


a  program  must  be  decomposed  into  many  small  tasks. 
Each  task  consists  of  only  a  few  instructions.  In  con¬ 
ventional  systems,  however,  the  overhead  of  scheduling, 
synchronization,  and  communication  is  many  hundreds 
of  instructions  per  task.  This  overhead  restricts  conven¬ 
tional  multicomputers  to  operating  at  a  very  coarse  grain 
size  -  thousands  of  instructions  per  task.  Concurrency  is 
reduced  because  there  are  fewer  large  tasks.  Also,  the  re¬ 
source  management  problems  become  harder  as  resources 
are  allocated  in  larger  chunks. 

Overhead  can  be  reduced  to  just  a  few  instructions  per 
task.  The  JOSS  operating  system,  using  the  primitive 
mechanisms  provided  by  the  MDP,  can  create,  suspend, 
resume,  or  destroy  a  task  in  fewer  than  ten  instructions 
[15].  This  efficient  management  of  fine-grain  tasks  is 
achieved  without  sacrificing  protection.  Each  task  ex¬ 
ecutes  in  its  own  naming  environment. 

2  CONCURRENT  COMPUTER 
ORGANIZATION 

To  make  the  most  efficient  use  of  projected  VLSI  technol¬ 
ogy,  general  purpose  concurrent  computers  will  be  con¬ 
structed  from  a  number  of  fine-grain  processing  nodes  (5) 
connected  by  a  low-latency,  wire-efficient  interconnection 
network  [3). 

2.1  Fine-Grain  Processing  Nodes 

The  gram  size  of  a  machine  refers  to  the  physical  size 
and  the  amount  of  memory  in  one  processing  node.  A 
coarse-grain  processing  node  requires  hundreds  of  chips 
(several  boards)  and  has  as  107  bytes  of  memory  while 
fine-grain  node  fits  on  a  single  chip  and  has  as  *04  bytes  jf 
memory.  Fine-grain  nodes  cost  less  and  have  less  memory 
than  coarse-grain  node?,  however,  because  so  little  silicon 
area  is  required  to  bu.  a  fast  processor,  they  need  not 
have  slower  processors  than  coarse-grain  nodes. 

VLSI  technology  makes  it  possible  to  build  small,  pow¬ 
erful  processing  elements.  A  lM-bit  DRAM  chip  has  an 
area  of  256MA’  (A  is  half  the  minimum  line  width  (13).). 
In  the  same  area  we  can  build  a  single  chip  processing 
node  containing: 


A  32-bit  processor  16MAJ 

A  fioating-point  unit  32MAJ 

A  communication  controller  8MAJ 
512Kbits  RAM  128MA* 


Such  a  single-chip  processing  node  would  have  the  same 
processing  power  as  a  board-sized  node  but  significantly 
less  memory  per  node.  The  memory  capacity  of  the  en¬ 
tire  machine  is  comparable  to  that  of  a  coarse-grained 
machine.  We  refer  to  a  machine  built  from  these  nodes 
as  a  jellybean  machine  as  it  is  built  with  commodity  part 
(jellybean)  technology  [8]. 


A  fine-grain  processing  node  has  two  major  advantages: 
density  and  memory  bandwidth.  Several  hundred  single¬ 
chip  nodes  can  be  packaged  on  a  single  printed  circuit 
board  permitting  us  to  exploit  hundreds  of  times  the  con¬ 
currency  of  machines  with  board-sized  nodes.  With  on- 
chip  memory  we  can  read  an  entire  row  of  memory  (128  or 
256  bits)  in  a  single  cycle  without  incurring  the  delay  of 
several  chip  crossings.  This  high  memory  bandwidth  al¬ 
lows  the  memory  to  simultaneously  buffer  messages  from 
a  high  bandwidth  network  and  provide  the  processor  with 
instructions  and  data. 

Fine  grain  machines  are  area  efficient.  Area  efficiency  is 
given  by  eA  =  A\T\/AnTh  (where  A,  is  the  area  of »  pro¬ 
cessors,  T,  is  execution  time  on  <  processors  and  N  is  the 
number  of  processors).  Many  researchers  have  measured 
their  machines  effectiveness  in  terms  of  node  efficiency, 
ear  =  T\/NTn  Proponents  of  coarse-grain  machines  ar¬ 
gue  that  a  machine  constructed  from  several  thousand 
single-chip  nodes  would  be  inefficient  because  many  of 
the  processing  nodes  will  be  idle.  N  is  large,  hence  en 
is  small.  A  user,  however,  is  not  concerned  with  N ,  but 
rather  with  machine  cost,  An,  and  how  long  it  takes  to 
solve  a  problem,  T.  Fine-grain  machines  have  a  very  high 
Ca  because  they  are  able  to  exploit  more  concurrency  in 
a  smaller  area. 

2.2  Wire-Efficient  Communication  Networks 

VLSI  systems  are  wire  limited.  The  cost  of  these  systems 
is  predominantly  that  of  connecting  devices,  and  the  per¬ 
formance  is  limited  by  the  delay  of  these  interconnections. 
Thus,  an  interconnection  network  must  make  efficient  use 
of  the  *  ire.  The  topology  of  the  network  must 

map  into  the  three  physical  dimensions  so  that  messages 
are  not  required  to  doable  back  on  themselves,  and  in  a 
way  that  allows  messages  to  use  all  of  the  available  band¬ 
width  along  their  path.  Also,  the  topology  and  routing 
algorithm  must  be  simple  so  the  network  switches  will  be 
sufficiently  fast  to  avoid  leaving  the  wires  idle  while  mak¬ 
ing  routing  decisions.  Our  recent  findings  suggest  that 
low-dimensional  A- ary  n-cube  interconnection  networks 
[3]  are  capable  of  providing  the  performance  required  by 
fine-grain  concurrent  architectures. 


3  TRANSITION  TO  MAINSTREAM 
CONCURRENT  COMPUTING 

Select  areas  of  mainstream  computing  will  switch  to  con¬ 
current  computers  when  (1)  concurrent  software  has  ma¬ 
tured  to  the  point  that  it  can  support  a  large  evolving 
application  and  (2)  the  performance  advantage  of  these 
machines  is  sufficient  to  justify  an  investment  in  new  soft¬ 
ware.  Concurrent  machines  are  appropriate  for  applica-, 
tions  that  are  (1)  limited  by  CPU  performance  (e.g.,  sci- 


entific  computing  and  signal  processing)  and  (2)  limited 
by  memory  system  bandwidth  (e.g.,  transaction  process¬ 
ing).  It  is  also  expected  that  the  availability  of  these 
machines  will  create  new  applications  that  were  not  pre¬ 
viously  possible. 
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Abstract 


CST  is  &  programming  language  based  on  Smalltalk-80  that  supports  concurrency  using  locks,  asynchronous  messages, 
and  distributed  objects.  In  this  paper,  we  describe  CST:  the  language  and  its  implementation.  Example  programs 
and  initial  programming  experience  with  CST  is  described.  An  implementation  of  CST  generates  native  code  for  the 
J-machine,  a  fine-grained  concurrent  computer.  Some  novel  compiler  optimizations  developed  in  conjunction  with 
that  implementation  are  also  described. 


Introduction 


This  paper  describes  CST,  an  object-oriented  concurrent  programming  language  based  on  Smalltalk-80  [7]  and  an 
implementation  of  that  language.  CST  adds  three  extensions  to  sequential  Smalltalk.  First,  messages  are  asyn¬ 
chronous.  Several  messages  can  be  sent  concurrently  without  waiting  for  a  reply.  Second,  several  methods  may 
access  an  object  concurrently;  locks  are  provided  for  concurrency  control.  Finally,  CST  allows  the  programmer 
to  describe  distributed  objects:  objects  with  a  single  name  but  distributed  state.  They  can  be  used  to  construct 
abstractions  for  concurrency. 

CST  is  being  developed  as  part  of  the  J-Machine  project  at  MIT  [4,  3].  The  J-Machine  is  a  fine-grain  concurrent 
computer.  The  primary  building  block  in  the  J-machine  is  the  Message- Driven  Processor  (MDP).  It  efficiently 
executes  tasks  with  a  grain  size  of  10  instructions  and  supports  a  global  virtual  address  space.  This  machine  requires 
a  programming  system  that  allows  programmers  to  concisely  describe  programs  with  method-level  concurrency  and 
that  facilitates  the  development  of  abstractions  for  concurrency. 

Object-oriented  programming  meets  the  first  of  these  goals  by  introducing  a  discipline  into  message  passing.  Each 
expression  implies  a  message  send.  Each  message  invokes  a  new  process.  Each  receive  is  implicit.  The  global  address 
space  of  object  identifiers  eliminates  the  need  to  refer  to  node  numbers  and  process  IDs.  The  programmer  does  not 
have  to  insert  send  and  receive  statements  into  the  program,  keep  track  of  process  IDs,  and  perform  bookkeeping  to 
determine  which  objects  are  local  and  which  are  remote. 

For  example,  a  CST  program2  that  counts  the  number  of  leaves  in  a  binary  tree  using  double  recursion  is  shown 
in  Figure  1.  Nowhere  in  the  program  does  the  programmer  explicitly  specify  a  send  or  receive,  and  no  node  num¬ 
bers  or  process  IDs  are  mentioned.  Yet,  as  shown  in  Figure  l3  the  program  exhibits  a  great  deal  of  concurrency. 
Making  message-passing  implicit  in  the  language  simplifies  programming  and  makes  it  easier  to  describe  fine-grain 
concurrency. 

CST  facilitates  the  construction  of  concurrency  abstractions  by  providing  distributed  objects:  objects  with  a  single 
name  whose  state  is  distributed  across  the  nodes  of  a  concurrent  computer.  The  one-to-many  naming  of  distributed 
objects  along  with  their  ability  to  process  many  messages  simultaneously  allows  them  to  efficiently  connect  together 

'The  research  described  in  this  paper  was  supported  in  part  by  the  Defense  Advanced  Research  Projects  Agency  and  monitored 
by  the  Office  of  Naval  Research  under  contracts  N 00014- 88K -0738,  N00014-S7K-0825,  and  N00014-85-K-0124,  in  part  by  a  National 
Science  Foundation  Presidential  Young  Investigator  Award  with  matching  funds  from  General  Electric  Corporation,  an  Analog  Device# 
Fellowship,  and  an  ONR  Fellowship 

3 This  program  it  in  p*tfix  CST,  s  dialect  that  has  s  syntax  resembling  LISP.  Infix  CST  [5]  has  s  syntax  closer  to  that  of  Smalltalk-80. 
3The  concurrency  profiles  presented  in  this  paper  are  produced  by  an  lcode  level  simulation  of  CST  programs. 
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(class  nods  (objsct)  loft  right  trss-nods?) 


(asthod  nods  eount-slsnsnta  ()  () 

(ii  trss-nods?  (♦  (couat-slssscts  loft) 

(count -slossnts  right)) 

.)) 


Figure  1:  A  CST  program  that  calculates  the  number  of  leaves  in  a  tree  using  double  recursion.  Its  eoncurency 
profile  (active  tasks  in  each  message  interval)  is  shown  to  the  right. 


large  numbers  of  objects.  Distributing  the  name  of  a  single  distributed  queue  to  sets  of  producer  and  consumer 
objects,  for  example,  connects  many  producers  to  many  consumers  without  a  bottleneck. 

The  Optimist  compiler  [8]  compiles  Concurrent  Smalltalk  to  the  assembly  language  of  the  Message-Driven  Processor 
(MDP)  [9].  It  includes  many  standard  optimisations  such  as  register  variable  assignment,  dataflow  analysis,  copy 
propagation,  and  dead  code  elimination  [2,  13]  that  are  used  in  compilers  for  conventional  processors.  Due  to  the  fine¬ 
grained  parallel  nature  of  the  J-machine,  compiling  for  the  MDP  is  unlike  compiling  for  most  conventional  processors 
in  a  few  important  aspects.  For  instance,  loops  are  not  important4,  while  minimizing  code  size,  tail  forwarding 
methods,  and  efficiently  and  seamlessly  handling  parallelism  are  extremely  important. 

The  development  of  Concurrent  Smalltalk  was  motivated  by  dissatisfaction  with  process-based  concurrent  program¬ 
ming  using  sends  and  receives  [11].  Many  of  the  ideas  have  been  borrowed  from  actor  languages  [1].  Another  language 
named  Concurrent  Smalltalk  has  been  developed  at  Keio  University  in  Japan  [14].  This  language  also  allows  message 
sending  to  be  asynchronous,  but  does  not  include  the  ability  to  describe  distributed  objects. 


Concurrent  Smalltalk 

Top-Level  Expressions 

A  CST  program  consists  of  a  number  of  top-level  expressions.  Top  level  forms  include  declarations  of  program 
and  data  as  well  as  executable  expressions.  Linking  of  programs  (the  resolution  from  selectors  to  methods)  is  done 
dynamically. 

<top-exp>  ;■  (Global  <global-varlabls>  <valus>)  I 
(Constant  <constant-naa«>  <valuo>)  I 
(Class  <dass-nasM>  (<soperclaases>)  <lnstanca-vars>)  I 
(Method  <class-nasM>  <asthod-nans> 

(<foraals>)  «locals» 

<arpreesiona>)  I 
<axprasslon> 


Globala  and  Constants  Globals  and  constant  declarations  define  names  in  the  environment.  These  names  are 
visible  in  all  programs,  unless  shadowed  by  a  instance,  argument,  or  local  variable  name.  The  global  declaration 
simply  defines  the  name.  Its  value  remains  unbound.  The  constant  declaration  defines  the  name  and  binds  the  name 
to  the  specified  value. 


Classes  Objects  are  defined  by  specifying  classes.  Objects  of  a  particular  class  have  the  same  instance  variables 
and  understand  the  same  messages.  A  class  may  inherit  variables  and  methods  from  one  or  more  superclasses.  For 

4  In  (act,  the  airrmt  vemoo  of  Concurrent  Smalltalk  does  not  even  have  loops. 
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example: 


(cl***  nod*  (object)  left  right  tree-node?) 

defines  a  class,  node,  that  inherits  the  properties  of  class  object  and  adds  three  instance  variables.  This  means 
that  methods  for  the  class  nod*  can  access  all  the  instance  variables  of  class  object  as  well  as  those  defined  in  their 
own  class  definition.  Methods  defined  for  class  object  are  also  inherited.  Of  course,  this  inheritance  is  transitive, 
so  nods  actually  inherits  from  a  series  of  classes  up  through  the  top  of  the  class  hierarchy.  Instance  variables  in  the 
class  definition  may  hide  (shadow)  those  defined  in  the  superclasses  if  they  have  the  same  name.  The  same  kind  of 
shadowing  is  allowed  for  selectors  (method  names). 


Methods  The  behavior  of  a  class  of  objects  is  defined  in  terms  of  the  messages  they  understand.  For  each  message, 
a  method  is  executed.  That  execution  may  send  additional  messages,  modify  the  object  state,  modify  the  object 
behavior,  and  create  new  objects.  Methods  consist  of  a  header  and  a  body.  The  header  specifies  class,  selector, 
arguments,  and  locals.  The  body  consists  of  one  or  more  expressions.  For  example: 


(method  nod*  count-eleaents  ()  () 

(if  tree-node?  (♦  (count-elenents  left) 
(count-eleaent*  right)) 


1)> 


defines  a  method  for  class  node  with  selector  count-eleaents.  The  two  empty  lists  indicate  that  there  are  no  explicit 
arguments  and  no  local  variables.  If  present,  the  keyword  reply  sends  the  result  of  the  following  expression  back 
to  the  sender  of  the  count-eleaents  message.  In  this  case,  there  is  no  reply  keyword,  so  the  method  replies  with 
the  value  of  tbe  last  expression.  If  the  programmer  wishes  to  suppress  the  reply,  he  can  use  the  (exit)  form  which 
causes  the  method  to  terminate  without  a  reply. 

Messages  are  sent  implicitly.  Every  expression  conceptually  involves  sending  a  message  to  an  object.  Of  course, 
commonly  occurring  special  cases,  like  adding  two  local  integers,  will  be  optimized  to  eliminate  the  send.  For 
example,  (count-eleaents  left),  sends  the  message  count-eleaents  to  left.  (+  x  y)  sends  the  message  ♦  with 
argument  x  to  object  y.  If  both  x  and  y  are  local  integers,  this  operation  can  be  optimized  as  an  add  instruction. 

Each  expression  consists  of  a  selector,  a  receiver,  and  zero  or  more  arguments.  Identifiers  must  be  one  of:  constant, 
global  variable,  argument,  local  variable,  or  instance  variable.  Subexpressions  may  be  executed  concurrently  and  are 
sequenced  only  by  data  dependence.  For  example,  in  the  following  expression  from  the  program  in  Figure  1 


(+  (count-eleaents  left)  (count-eleaents  right)) 


the  two  count-eleaents  messages  will  be  sent  concurrently  and  the  *  message  will  be  sent  when  both  replies  have 
been  received.  The  only  way  to  serialize  subexpression  evaluation  is  to  assign  intermediate  results  to  local  variables. 

A  complete  list  of  CST  expressions  is  shown  below: 

<*xps>  :■  <«pH 
<*xp>  :■ 

<naae>  | 

(<s*l*ctor>  <r*c*iT*r-*xp>  <argua*nt-*rp>*)  I 

(sand  <s*l*ctor-ezp>  <r*c*i*er-*rp>  <*rgua*nt-axp>*)  | 

(value  <exp>)  I 
(set  <naae>  <*xp>)  t 
(cset  <naae>  <*xp>)  I 

(aag  <node>  <s*l*ctor>  <r*c*iv*r>  <actuals>)  I 
(forsard  <continuation>  <**l*ctor>  <r*c*iv*r>  <*rgs>)  t 
(reply  <*xp>)  I 
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(block  (<fonals»  (<locals>)  <«xpo>)  I 
(if  <«rp>  <erp>  <exp>)  I 
(bog in  <«xpe»  I 
(•lit) 


An  Example  CST  Program 


We  now  introduce  a  slightly  more  complicated  version  of  the  program  shown  in  Figure  1.  Rather  than  simply  counting 
the  leaves  on  a  tree,  we  compute  the  lengths  of  all  the  lists  linked  to  the  tree  and  sums  those  lengths  together. 


(class  nod*  (object)  loft  right) 

(■•thod  nods  count -list -elements  ()  () 
(+  (count-list-eleaents  left) 
(count -list -eles ants  right))) 

(class  pair  (object)  car  edr) 

(set hod  pair  count -list -el aa ants  ()  () 
(length  right  0  ))) 

(aethod  pair  length  (n)  () 

(if  (eq  edr  ’nil)  (+  1  n) 

(length  edr  (♦  1  n)))) 


Figure  2:  A  CST  program  that  computes  sum  of  list  lengths  and  its  execution  profile 


The  node  class  definition  is  the  same  as  it  was  in  Figure  1.  left  and  right  are  the  children  of  the  current  node 
in  a  binary  tree.  The  right  of  each  leaf  node  points  to  a  linked  list  of  pairs.  The  method  count-list-eleaents 
recursively  counts  the  lists  lengths  by  doing  so  for  the  right  subtree  and  the  left  subtree  concurrently.  At  the  bottom 
of  the  tree,  the  late  binding  SEND  operation  causes  the  count-list-eleaents  method  for  pairs  to  be  invoked.  This 
method  computes  the  length  of  each  list  using  the  tail  recursive  method  length. 


Distributed  Objects 

CST  programs  exhibit  parallelism  between  objects,  that  is  many  objects  may  be  actively  processing  messages  si¬ 
multaneously.  However,  ordinary  objects  can  only  receive  one  message  at  a  time.  CST  relaxes  this  restriction  with 
Distributed  Objects  (DOs).  Distributed  objecU  are  made  up  of  multiple  representatives  (constituent  objects)  that 
can  each  accept  messages  independently.  The  distributed  object  has  a  name  (Distributed  object  ID  or  DID)  and  all 
other  objects  send  messages  to  this  name  when  they  wish  to  use  the  DO. 

Messages  sent  to  the  DO  are  received  by  one  and  only  one  constituent  object  (CO).  Which  constituent  receives 
the  message  is  left  unspecified  in  the  language.  A  clever  implementation  might  send  the  messages  to  the  closest 
constituent  whereas  a  simpler  implementation  might  send  the  messages  to  a  random  constituent.  The  state  of  a 
distributed  object  is  typically  distributed  over  the  constituents.  This  means  that  responding  to  an  external  request 
often  requires  the  passing  of  messages  amongst  the  constituents  before  replying.  No  locking  is  performed  on  the 
distributed  object  as  a  whole.  This  means  that  the  programmer  must  ensure  the  consistency  of  the  distributed 
object. 
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Support  for  Distributed  Objects 


CST  includes  two  constructs  to  support  distributed  objects.  For  DO  creation,  we  add  an  argument  for  the  nev  selector 
-  the  number  of  constituents  desired  in  this  DO.  In  order  to  pass  messages  within  the  object,  each  constituent  object 
must  be  able  to  address  each  of  the  other  constituents.  This  is  implemented  with  the  special  selector  co.  Each 
distributed  object  can  use  this  selector,  the  special  instance  variable  gronp  (a  reference  to  the  DO),  and  an  index 
to  address  any  constituent.  For  example,  (co  group  S)  refers  to  the  5th  constituent  of  a  distributed  object.  Each 
constituent  also  has  access  to  its  own  index  and  the  number  of  constituents  in  the  entire  distributed  object.  Thus  a 
description  of  a  distributed  object  might  look  something  like  the  example  shown  in  Figure  3. 


ii  Olltrtbutad  Array  Abstraction. 

ii  Th«  Conit  1 tuonts  art  spread  throughout  tho  machine. 

l;  Tho  array  atoto  la  allocated  into  equal  aized  chunks  an  tho  constituents, 
(class  dlatarray  (dialed.))  nr-alta  chunk-olio  alt-array) 


Stven  an  uninitialized  00.  Imt  eakaa  each  one  an  array, 
tolls  it  now  many  olta  it  has,  and  how  eany  elements 
are  In  tho  entire  array. 


(method  dlatarray  Inlt  (arr-alzo)  () 

(do-1  self  (block  (canatlt  alts)  () 

(co-init  (CO  (group  canatlt)  (mylndex  canatlt))  alts) 
(reply  canatlt)) 

arr-alza)) 

helper  far  Inlt 

(method  dlatarray  ce-lnit  (alta)  () 

(begin  (act  chunk-size  (/  alts  (•  t  maxlnoex))) 

(set  nr-alta  altJ) 

(set  alt-array  (nee  array  chunk-size)) 

)) 

Tree  recursive  apply,  with  one  erg, want 

(method  dlatarray  de-1  (ableck  ergi)  ()  (lda-1  (co  group  0)  at  lock  aryl)) 
(method  dlatarray  lda-1  (attack  ergi )  (a  b  linden  rindea) 

(sat  linden  (linden  self)) 

(sat  rmaen  (rindan  self)) 

(coat  a  (If  (<•  linden  nanZnden)  (lde-1  (ee  group  linden)  attack  ergi) 
•())) 

(caet  b  (if  (<•  rindan  aanlnden)  (lda-1  (ca  group  nndex)  ablack  argl) 
•())) 

(touch  a  b) 

(reply  (value  ableck  self  argl)) 

(amt)) 

Select  array  oleemnt  at  Index 

I 

(method  dlatarray  at  (Index)  (selector) 

(If  (or  (<  Index  (a  chunk-size  uy Index)) 

(>•  Index  (a  chunk-slza  (♦  my Index  1)))) 

(begin  (set  selector  (truncate  (S  index  chunk-size))) 

(forward  requestor  at  (ea  group  Sal  actor)  index) 

(exit!) 

(at  alt-array  (mad  Index  chunk-size)))) 


Set  array  ela 


nt  at  index  ta  value 


(method  dlatarray  at. put  (Index  value)  (selector) 

(If  (or  (<  index  (a  chunk-slza  ay  Index)) 

(>•  Index  (•  chunk-slza  (♦  my Index  1)))) 

(begin  (set  selector  (truncate  (/  index  chunk-size))) 

(forward  requester  at. put  (co  group  selector)  index  value) 
(exit)) 

(at. put  pit-array  (mod  Index  chunk-size)  value))) 

>; 

;;  to  make  a  dlatarray  ef  !M  constituents  and  Hit  elements  Os 

1  i 

ii  (Inlt  (now  dlatarray  2M)  102S) 
ii 


Figure  3:  A  Distributed  Array  Example 


In  the  example  of  the  distributed  array,  we  would  create  a  usable  array  with  two  steps.  First  we  construct  the 
distributed  object  using  the  new  form.  The  example  in  Figure  3  creates  a  distributed  object  with  258  constituents. 
After  the  DO  is  created,  we  must  initialize  in  a  way  that  is  appropriate  for  the  distributed  array.  We  do  so  by  sending 
it  an  init  message  (also  defined  in  Figure  3).  This  initialization  sets  each  constituent  up  with  an  private  array  of  the 
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appropriate  number  of  elements.  For  example,  if  we  wanted  a  distarray  of  512  elements,  in  this  case  each  constituent 
would  have  a  private  array  of  two  elements.  This  initialization  is  done  in  a  tree  recursive  fashion  and  therefore  takes 
0{lg(n))  time. 

The  mapping  of  the  distarray  elements  onto  the  private  arrays  is  done  by  the  at  and  at. put  methods.  Each 
constituent  is  responsible  for  a  contiguous  range  of  the  distarray  elements.  Any  requests  received  by  a  constituent 
are  first  checked  to  see  if  they  are  within  the  local  CO's  jurisdiction.  If  they  are  not,  they  are  forwarded  to  the 
appropriate  CO.  If  they  are,  the  request  is  handled  locally.  This  is  a  particularly  simple  example  because  each 
constituent  is  wholly  responsible  for  his  subrange  and  need  not  negotiate  with  other  constituents  before  modifying 
his  local  state. 

Distributed  objects  are  of  great  utility  in  building  large  objects  on  a  fine  grain  machines.  In  the  J-machine,  we  restrict 
ordinary  objects  to  fit  within  the  memory  of  single  node,  thus  restricting  object  size.  With  distributed  objects,  w« 
only  require  that  a  constituent  of  the  DO  fit  on  a  single  node.  Some  useful  examples  for  distributed  objects  are 
dictionaries,  distributed  arrays,  sets,  queues,  and  priority  queues. 


Experience  with  CST 


We  have  written  a  large  number  of  Concurrent  Smalltalk  programs  and  executed  them  on  our  Icode  simulator.  These 
programs  include  various  data  structures,  distributed  arrays,  sets,  rings,  B-trees,  grids,  and  matrices.  They  also 
include  several  application  kernels:  N-body  interaction  and  charged  particle  transport  (Particle-in-cell  algorithm). 
To  date,  the  programs  studied  range  from  toys  to  applications  of  over  1000  lines.  It  is  clear  from  our  experience 
that  CST  programs  exhibit  large  amounts  or  parallelism.  However,  we  are  just  beginning  to  exploit  the  potential  of 
Distributed  Objects  as  building  blocks  for  concurrent  programs.  We  will  continue  to  study  data  structures,  algorithms 
and  full-blown  applications  in  our  continuing  evaluation  of  Concurrent  Smalltalk. 


The  Optimist  Compiler  for  CST 


Goals 

The  main  goal  of  the  Optimist  compiler  is  to  produce  Concurrent  Smalltalk  code  that  is  as  small  as  possible  without 
sacrificing  speed.  In  almost  all  cases  optimizations  that  reduce  space  also  reduce  speed,  but  there  are  a  few  cases  in 
which  they  conflict;  in  those  cases  the  decisions  were  made  in  favor  of  optimizing  space.  Compilation  speed  was  not 
a  major  goal  of  the  compiler  project;  simplicity  and  flexibility  were  considered  more  important.  Still,  the  compiler 
does  achieve  reasonable  compilation  speed,  taking  between  one  and  fifteen  seconds  to  compile  mo6t  methods  on  a 
2-megabyte  Macintosh5  II  using  Coral  Software’s  Allegro  Common  Lisp. 


Organization 

The  Optimise  compiler  is  comprised  of  four  phases,  as  shown  in  Figure  4.  The  Concurrent  Smalltalk  Front  End  can 
be  replaced  by  other  front  ends  to  compile  other  languages  for  the  MDP.  Also,  the  Icode  can  be  extracted  from  two 
places  in  the  compilation  process  and  either  compiled  onto  different  hardware  or  run  on  an  Icode  simulator. 

The  source  code  is  converted  by  the  Front  End  into  an  intermediate  language  called  Icode.  The  Icode  is  at  a 
somewhat  higher  level  than  the  triples  or  quadruples  codes  that  most  compilers  use,  in  that  it  specifies  units  such 
as  entire  procedure  calls  in  single  instructions.  The  Icode  also  allows  for  the  possibility  of  having  more  than  one 
source  language  compile  into  MDP  assembly  language  code  or  having  the  same  source  language  compile  into  several 
assembly  languages.  Figure  5  shows  the  length  method  in  its  Icode  form. 

^Macintosh  i*  a  trademark  of  Apple  Computer,  Inc. 
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C5T  Sou  rot  Cod* 


From  End 


1= 


-  Othar  From  Ends 


I -Coda 


J-Coda  Simulator 


Koda  Simulator 


MDP  Alaamhly  Coda 

Figure  4:  Compiler  Organization. 


(CSEID  (TOIP  0)  (METHOD  EQ)  (IV  AR  1)  (C01ST  IIL)) 
(FALSE JUMP  (TEMP  0)  0) 

(CSEID  (TEMP  1)  (METHOD  ♦)  (COIST  1)  CARO  0)) 

(JUMP  1) 

(LABEL  0) 

(CSEID  (TEMP  2)  (METHOD  ♦)  (COIST  1)  (ARC  0)) 
(CSEID  (TEMP  1)  (METHOD  LEIOTH)  (IVAR  n  (TEMP  2)) 
(LABEL  1) 

(RETORI  (TEMP  1)) 


Figure  5:  Icode  for  the  length  Method:  The  Icode  output  by  the  Front  End  ia  a  literal  translation  of  the  source 
code  with  few  optimizations.  At  this  point  all  method  calls,  including  primitives,  are  compiled  as  CSENDs. 


The  Statement  Analyser  and  Optimizer  processes  and  optimises  the  Icode  generated  by  the  Front  End.  It  performs 
all  of  the  compiler’s  optimizations  that  are  relevant  at  the  Icode  level  of  abstraction.  Internally  it  works  with  Icode 
in  the  form  of  a  directed  control-flow  graph.  These  optimizations  include  dead  code  elimination,  move  elimination, 
dataflow  transformations,  constant  folding,  tail  forwarding,  and  merging  of  identical  statements  on  both  sides  of 
paths  of  a  conditional.  The  optimizations  are  repeatedly  attempted  until  none  of  them  can  improve  the  code. 

The  Instruction  Generator  compiles  each  Icode  statement  to  a  number  of  quasi-MDP  instructions  and  outputs  the 
MDP  code  in  the  form  of  a  directed  control-flow  graph.  At  the  same  time,  the  Instruction  Generator  assigns  variables 
to  either  registers  or  memory  locations  and  performs  statement-specific  optimizations  on  I  codes. 

The  Assembly  Code  Generator  inserts  branches  into  the  directed  graph  of  quasi-MDP  instructions  created  by  the 
Instruction  Generator  and  performs  several  peep-hole  optimizations.  The  important  optimizations  include  shifting 
instructions  wherever  possible  to  align  DC  (Load  Constant)  instructions  to  word  boundaries  (all  other  instruc¬ 
tions  need  only  be  aligned  at  half-word  boundaries)  and  combining  SEND  and  SENDE  instructions  to  SEND2  and 
SEND2E.  The  Assembly  Code  Generator  replaces  short  branches  by  long  ones  where  necessary;  such  replacements 
are  complicated  by  the  fact  that  long  branches  alter  the  value  of  MDP’s  register  RO.  The  Assembly  Code  Generator 
outputs  a  file  of  assembly  language  statements  which  can  be  read,  assembled,  and  executed  by  our  MDP  simulator 
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MDPSim  [10] .  Figure  6  contains  the  assembly  code  output  for  the  sample  method  length. 


MODULE  PAIR _ LENGTH 

DC  MSG:LoadCodeM8 

DC  {n*ss.PAIR}.{Method.LEHCTH}> 


MOVE  [2,A3],R0  ;  0 

XLATE  R0 , A2 , XLATE.OB J  ;  O.S 

HOVE  1.R3  ;  1 

ADD  R3, [3, A3] ,R2  ;  1.5 

HOVE  [3.A21.R1  ;  2 

BMIIL  Rl.'LOOl  ;  2.5 

MOVE  C4.A33.R1  ;  3 

BRIL  R1,*L002  ;  3.5 

DC  MSGiRsplyConst-M  ;  4 

VTAG  R1.1.R3  ;  5 

LSH  R3.-16.R3  ;  5.5 

SERD2  R3.R0  ;  6 

SERD  R1  ;  6.5 

SERD2E  [5.A33.R2  ;  7 

BR  *L002  !  7.5 

LOOl:  MOVE  [3.A23.RO  ;  8 

CALL  Send_Rode.Br  8.5 

DC  HSG:SendCon*t+7  ;  9 

SERD2  Rl.RO  ;  10 

DC  {Method.LERGTH}  ;  11 

SERD  RO  ;  12 

SERD2  C3.A23.R2  ;  12.5 

SERD  [4, A3 3  ;  13 

SERDE  [S.A33  ;  13.5 

L002:  SUSPEBD  ;  14 

ERD 


Figure  6:  Final  Output  of  the  Compiler:  This  is  the  MDP  assembly  code  into  which  the  length  method  compiles.  If 
the  optimizations  were  turned  off,  the  code  size  would  have  been  32  words,  more  than  twice  the  size  of  the  optimized 
code. 


Optimizations 

Tail  Forwarder  The  tail  forwarder  performs  the  message-passing  equivalent  of  tail  recursion.  It  is  often  the  case 
that  the  value  returned  by  a  Concurrent  Smalltalk  method  is  the  value  returned  by  the  last  statement  of  that  method, 
and  that  statement  is  often  a  method  call.  An  example  of  this  phenomenon  is  a  recursive  definition  of  the  lsagth 
function  in  Figure  2. 

If  cdr  is  not  equal  to  nil,  the  length  method  makes  a  recursive  call  and  when  that  call  returns,  it  immediately 
returns  that  value  as  the  result.  There  is,  however,  no  fundamental  reason  why  length  should  wait  for  the  result 
of  the  recursive  call  to  length  only  to  return  it  to  the  caller;  on  the  contrary,  it  would  be  better  if  the  recursive 
length  call  returned  its  result  to  the  initial  caller,  length  optimized  this  way  runs  in  constant  space  instead  space 
proportional  to  the  list  length.  The  Tail  Forwarder  performs  this  optimization  by  looking  for  a  CSEND  statement 
whose  value  is  returned  by  a  REPLY  statement  immediately  afterwards.  Such  a  CSEND  statement  is  modified  to 
inform  the  callee  to  return  its  result  to  this  method’s  caller  instead  of  this  method. 


Fork  and  Join  Mergers  These  two  optimizations,  if  they  can  be  applied,  often  produce  significant  savings  in  the 
output  code  size.  They  try  to  consolidate  similar  statements  on  both  sides  of  forks  (conditionals)  and  joins  (places 
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where  two  paths  of  control  flow  merge)  in  the  control-flow  graph. 

The  Join  Merger  looks  for  similar  statements  immediately  preceding  each  join  in  the  control-flow  graph.  Here  two 
statements  are  considered  to  be  similar  if  they  are  identical  or  if  they  are  both  CSENDs  with  identical  targets  and 
the  same  number  of  arguments;  the  arguments  themselves  need  not  be  the  same.  The  Join  Merger  moves  both 
statements  after  the  join;  if  the  statements  were  not  identical,  MOVEs  are  generated  to  copy  any  differing  arguments 
into  temporaries  before  the  join;  the  combined  statement  after  the  join  will  use  the  temporaries  instead  of  the  original 
arguments.  These  MOVEs  are  usually  later  removed  by  the  Move  Eliminator.  Although  more  than  two  paths  of 
control  flow  can  join  at  the  same  place,  the  Join  Merger  only  considers  them  pairwise;  if  more  than  two  paths  can 
be  merged,  initially  two  will  be  merged,  with  the  other  ones  considered  in  a  later  pass.  The  Fork  Merger  operates 
analogously  except  that  it  also  has  to  be  sure  not  to  affect  the  value  of  the  condition  determining  which  branch  the 
program  will  take. 

The  Join  Merger  occasionally  merges  two  completely  different  method  calls  which  happen  to  have  the  same  number 
of  arguments,  but  which  may  even  call  different  methods  (the  method  selector  is  treated  as  an  argument  like  any 
other),  a  rather  unexpected  optimization  indeed.  In  each  branch  just  before  the  join,  the  resulting  object  code  copies 
the  differing  method  arguments  into  the  MDP’s  registers  and  stores  the  appropriate  method  selector  in  a  register. 
After  the  join  is  common  code  that  sent  the  message  given  the  method  selector  and  arguments  in  the  registers.  Since 
the  code  to  send  a  message  is  long  compared  to  the  code  to  load  values  into  registers,  the  optimization  has  a  net 
savings  of  five  words  (ten  instructions)  of  code  without  significantly  affecting  the  running  time. 


Move  Eliminator  For  each  MOVE  statement  from  a  local  variable  to  another  local  variable,  the  Move  Eliminator 
attempts  to  merge  the  source  and  destination  variables  into  one  variable  and  then  remove  the  MOVE  statement. 
Such  a  merge  can  be  done  successfully  if  the  two  variables  are  never  simultaneously  live  at  any  point  in  the  code. 

The  Move  Eliminator  complements  the  copy-propagation  algorithm  in  the  Optimist.  Although  both  try  to  optimise 
MOVE  statements,  each  is  able  to  handle  cases  that  the  other  cannot.  The  copy  propagation  can  handle  constants, 
while  Figure  7  shows  an  example  of  MOVE  statements  that  can  be  eliminated  by  the  Move  Eliminator  but  not  by 
copy  propagation. 


Figure  7:  Move  Eliminator  Example:  The  Move  Eliminator  is  able  to  remove  the  two  MOVE  statements  (a*-b)  and 
(a*— c)  in  the  above  code  (the  arrows  indicate  possible  flow  of  control  paths).  The  copy  propagation  algorithm  would 
not  detect  the  opportunity  to  remove  these  two  MOVE  statements  because  the  value  of  a  at  the  return  statement  is 
neither  a  copy  of  b  nor  a  copy  of  e.  The  above  code  does  occur  in  many  methods. 


Variable  Allocator  A  greedy  algorithm  is  used  to  assign  eligible  variables  to  registers.  The  shortest-lived  variables 
with  the  most  references  are  considered  first.  A  graph  coloring  algorithm  is  used  to  assign  the  variables  that  did  not 
fit  in  the  registers  to  context  slots;  thus,  fewer  context  slots  are  used,  saving  valuable  memory  space. 
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Summary 


In  this  paper,  we  have  presented  a  new  language,  Concurrent  Smalltalk,  that  is  designed  for  concurrency.  Specific 
support  for  concurrency  includes  locks,  distributed  objects,  and  asynchonous  message  passing. 

Distributed  Objects  represent  a  significant  innovation  in  programming  parallel  machines.  We  refer  to  the  constituents 
of  a  distributed  object  with  a  single  name,  but  the  implementation  of  the  object  is  with  many  constituents.  This 
different  perspective  allows  easy  use  of  distributed  objects  by  outside  programs  while  allowing  the  exploitation  of 
internal  concurrency. 

We  have  described  an  implementation  of  a  CST  system.  This  programming  environment  includes  a  compiler,  simu¬ 
lator,  and  statistics  collection  package.  This  set  of  tools  allows  us  to  experiment  with  new  constructs  and  implemen¬ 
tation  techniques  for  the  language.  Although  many  of  the  optimizations  used  by  the  Optimist  compiler  are  generally 
known,  they  have  usually  been  applied  to  compilers  for  conventional  processors.  The  issues  involved  in  compiling 
for  the  MDP  are  quite  different  from  compiling  for  conventional  processors.  After  examining  the  compiler’s  output, 
it  becomes  apparent  that  the  optimizations  are  essential  to  the  successful  use  of  Concurrent  Smalltalk  on  the  MDP. 
The  compiler’s  optimizations  reduce  the  amount  of  code  output  by  anywhere  between  20%  and  60%  (or  even  more 
in  some  cases)  compared  to  output  with  all  nonessential  optimizations  disabled.  Such  a  reduction  is  very  important 
on  a  processor  with  only  4096  words  of  primary  memory. 

There  are  many  open  issues  relating  to  CST  and  similar  programming  systems.  Key  efficiency  issues  remain  unre¬ 
solved:  how  fine  grain  will  the  programs  written  in  CST  be  and  what  is  the  run  time  overhead  of  CST  programs? 
There  are  also  concerns  about  the  expressive  power  of  languages  like  CST  -  how  easy  is  it  to  write  programs  in  CST 
and  how  useful  are  distributed  objects? 
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Abstract 

In  the  analog  VLSI  implementation  of  neural  systems,  it  is  sometimes  convenient  to  build 
lateral  inhibition  networks  by  using  a  locally  connected  on-chip  resistive  grid.  A  serious 
problem  of  unwanted  spontaneous  oscillation  often  arises  with  these  circuits  and  renders 
them  unusable  in  practice.  This  paper  reports  a  design  approach  that  guarantees  such  a 
system  will  be  stable,  even  though  the  values  of  designed  elements  in  the  resistive  grid  may 
be  imprecise  and  the  location  and  values  of  parasitic  elements  may  be  unknown.  The 
method  is  based  on  a  mathematical  analysis  using  Tellegen’s  theorem  and  the  Popov 
criterion.  The  criteria  are  local  in  the  sense  that  no  overall  analysis  of  the  interconnected 
system  is  required  for  their  use,  empirical  in  the  sense  that  they  involve  only  measurable 
frequency  response  data  on  the  individual  cells,  and  robust  in  the  sense  that  they  are  not 
affected  by  unmodelled  parasitic  resistances  and  capacitances  in  the  interconnect  network. 
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Abstract 

In  the  analog  VLSI  implementation  of  neural  systems, 
it  is  sometimes  convenient  to  build  lateral  inhibition  net¬ 
works  by  using  a  locally  connected  on-chip  resistive  grid. 
A  serious  problem  of  unwanted  spontaneous  oscillation 
often  arises  with  these  circuits  and  renders  them  unus¬ 
able  in  practice.  This  paper  reports  a  design  approach 
that  guarantees  such  a  system  will  be  stable,  even  though 
the  values  of  designed  elements  in  the  resistive  grid  may 
be  imprecise  and  the  location  and  values  of  parasitic  el¬ 
ements  may  be  unknown.  The  method  is  based  on  a 
mathematical  analysis  using  Tellegen’s  theorem  and  the 
Popov  criterion.  The  criteria  are  local  in  the  sense  that  no 
overall  analysis  of  the  interconnected  system  is  required 
for  their  use,  empirical  in  the  sense  that  they  involve 
only  measurable  frequency  response  data  on  the  individ¬ 
ual  cells,  and  robust  in  the  sense  that  they  are  not  affected 
by  unmodelled  parasitic  resistances  and  capacitances  in 
the  interconnect  network. 

I.  Introduction 

The  term  “lateral  inhibition”  first  arose  in  neurophys¬ 
iology  to  describe  a  common  form  of  neural  circuitry  in 
which  the  output  of  each  neuron  in  some  population  is 
used  to  inhibit  the  response  of  each  of  its  neighbors.  Per¬ 
haps  the  best  understood  example  is  the  horizontal  cell 
layer  in  the  vertebrate  retina,  in  which  lateral  inhibition 
simultaneously  enhances  intensity  edges  and  acts  as  an 
automatic  gain  control  to  extend  the  dynamic  range  of 
the  retina  as  a  whole  [1].  The  principle  has  been  used 
in  the  design  of  artificial  neural  system  algorithms  by 
Kohonen  [2]  and  others  and  in  the  electronic  design  of 
neural  chips  by  Carver  Mead  et.  ai.  [3,4]. 

In  the  VLSI  implementation  of  neural  systems,  it  is 
convenient  to  build  lateral  inhibition  networks  by  using 
a  locally  connected  on-chip  resistive  grid.  Linear  resis¬ 
tors  fabricated  in,  e.g.,  polysilicon,  could  yield  a  very 
compact  realization,  and  nonlinear  resistive  grids,  made 
from  MOS  transistors,  have  been  found  useful  for  image 
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Figure  1:  This  photoreceptor  and  signal  processor  cir¬ 
cuit,  using  two  MOS  amplifiers,  realizes  lateral  inhibition 
by  communicating  with  similar  cells  through  a  resistive 
grid. 

segmentation  [4,5].  Networks  of  this  type  can  be  divided 
into  two  classes:  feedback  systems  and  feedforward-only 
systems.  In  the  feedforward  case  one  set  of  amplifiers 
imposes  signal  voltages  or  currents  on  the  grid  and  an¬ 
other  set  reads  out  the  resulting  response  for  subsequent 
processing,  while  the  same  amplifiers  both  “write  to”  the 
grid  and  “read  from”  it  in  a  feedback  arrangement.  Feed¬ 
forward  networks  of  this  type  are  inherently  stable,  but 
feedback  networks  need  not  be. 

A  practical  example  is  one  of  Carver  Mead’s  retina 
chips  [3]  that  achieves  edge  enhancement  by  means  of  lat¬ 
eral  inhibition  through  a  resistive  grid.  Figure  1  shows  a 
single  cell  in  a  continuous-time  version  of  this  chip,  and 
Fig.  2  illustrates  the  network  of  interconnected  cells. 
Note  that  the  voltage  on  the  capacitor  in  any  given  cell 
is  affected  both  by  the  local  light  intensity  incident  on 
that  cell  and  by  the  capacitor  voltages  on  neighboring 
cells  of  identical  design.  Each  cell  drives  its  neighbors, 
which  drive  both  their  distant  neighbors  and  the  original 
cell  in  turn.  Thus  the  necessary  ingredients  for  instabil¬ 
ity  —  active  elements  and  signal  feedback  —  are  both 
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Figure  2:  Interconnection  of  cells  through  a  hexagonal  re¬ 
sistive  grid.  Cells  are  drawn  as  2-terminal  elements  with 
the  power  supply  and  signal  output  lines  suppressed.  The 
grid  resistors  will  be  nonlinear  by  design  in  many  such 
circuits. 

present  in  this  system.  Experiment  has  shown  that  the 
individual  cells  in  this  system  are  open-circuit  stable  and 
remain  stable  when  the  output  of  amp  #  2  is  connected 
to  a  voltage  source  through  a  resistor,  but  the  intercon¬ 
nected  system  oscillates  so  badly  that  the  original  design 
is  essentially  unusable  in  practice  with  the  lateral  inhibi¬ 
tion  paths  enabled  [6].  Such  oscillations  can  readily  occur 
in  most  resistive  grid  circuits  with  active  elements  and 
feedback,  even  when  each  individual  cell  is  quite  stable. 
Analysis  of  the  conditions  of  instability  by  conventional 
methods  appears  hopeless,  since  the  number  of  simulta¬ 
neously  active  feedback  loops  is  enormous. 

This  paper  reports  a  practical  design  approach  that 
rigorously  guarantees  such  a  system  will  be  stable.  The 
work  begins  with  the  naive  observation  that  tEe  system 
would  be  stable  if  we  could  design  each  individual  cell 
so  that,  although  internally  active,  it  acts  like  a  passive 
system  as  seen  from  the  resistive  grid.  The  design  goal 
in  that  case  would  be  that  each  cell’s  output  impedance 
should  be  a  positive-real  [7,8,  and  9,  p.  174]  function. 
This  is  sometimes  possible  in  practice;  we  will  show  that 
the  original  network  in  Fig.  1  satisfies  this  condition  in 
the  absence  of  certain  parasitic  elements.  Furthermore,  it 
is  a  condition  one  can  verify  experimentally  by  frequency- 
response  measurements. 

It  is  obvious  that  a  collection  of  cells  that  appear  pas¬ 
sive  at  their  terminals  will  form  a  stable  system  when 
interconnected  through  a  passive  medium  such  as  a  re¬ 
sistive  grid,  and  that  the  stability  of  such  a  system  is  ro¬ 
bust  to  perturbations  by  passive  parasitic  elements  in  the 
network.  The  contribution  of  this  paper  is  to  go  beyond 
that  observation  to  provide  i)  a  demonstration  that  the 
passivity  or  positive- real  condition  is  much  stronger  than 
we  actually  need  and  that  weaker  conditions,  more  easily 


Figure  3:  Elementary  model  for  an  MOS  amplifier. 
These  amplifiers  have  a  relatively  high  output  resistance, 
which  is  determined  by  a  bias  setting  (not  shown). 

achieved  in  practice,  suffice  to  guarantee  robust  stabil¬ 
ity  of  the  linear  network  model,  and  ii)  an  extension 
of  the  analysis  to  the  nonlinear  domain  that  furthermore 
rules  out  sustained  large-signal  oscillations  under  certain 
conditions. 

Note  that  the  work  reported  here  does  not  apply  di¬ 
rectly  to  networks  created  by  interconnecting  neuron-like 
elements,  as  conventionally  described  in  the  literature  on 
artificial  neural  systems,  through  a  resistive  grid.  The 
“neurons”  in,  e.g.,  a  Hopfield  network  [10]  are  unilateral 
2-port  elements  in  which  the  input  and  output  are  both 
voltage  signals.  The  input  voltage  uniquely  and  instan¬ 
taneously  determines  the  output  voltage  of  such  a  neuron 
model,  but  the  output  can  only  affect  the  input  via  the 
resistive  grid.  In  contrast,  the  cells  in  our  system  are  1 - 
port  electrical  elements  (temporarily  ignoring  the  optical 
input  channel)  in  which  the  port  voltage  and  port  cur¬ 
rent  are  the  two  relevant  signals,  and  each  signal  affects 
the  other  through  the  cell’s  internal  dynamics  (modelled 
as  a  Thevenin  equivalent  impedance)  as  well  as  through 
the  grid’s  response. 

II.  The  Linear  Theory 

This  work  was  motivated  by  the  following  linear  anal¬ 
ysis  of  a  model  for  the  circuit  in  Fig.  1.  For  an  initial 
approximation  to  the  output  admittance  of  the  cell  we 
use  the  elementary  model  shown  in  Fig.  3  for  the  ampli¬ 
fiers  and  simplify  the  circuit  topology  within  a  single  cell 
(without  loss  of  relevant  information)  as  shown  in  Fig. 
4. 

Straightforward  calculations  show  that  the  output  ad¬ 
mittance  is 

y(«)  - 1*»  +  +  «C„I  +  )  ■  (1) 

which  is  positive-real. 


Figure  4:  Simplified  network  topology  for  the  circuit  in 
Fig.  1.  The  capacitor  that  appears  explicitly  in  Fig.  1 
has  been  absorbed  into  C4, . 


Of  course  this  model  is  oversimplified,  since  the  cir¬ 
cuit  does  oscillate.  Transistor  parasitics  and  layout  par- 
asitics  cause  the  output  admittance  of  the  individual  ac¬ 
tive  cells  to  deviate  from  the  form  given  in  eq.  (1),  and 
any  very  accurate  model  will  necessarily  be  quite  high 
order.  The  following  theorem  shows  how  far  one  can  re¬ 
lax  the  positive-real  condition  and  still  guarantee  that 
the  entire  network  is  robustly  stable. 

Terminology 

The  terms  open  right-half  plane  and  closed  right-half 
plane  refer  to  the  set  of  all  complex  numbers  s  =  a  +  jw 
with  a  >  0  and  a  >  0,  respectively,  and  the  term  closed 
second  quadrant  refers  to  the  set  of  complex  numbers 
with  o  <  0  and  w  >  0.  A  natural  frequency  of  a  lin¬ 
ear  network  is  a  complex  frequency  s0  such  that,  when 
all  independent  sources  are  set  to  zero  and  all  branch 
impedances  and  admittances  are  evaluated  at  s„,  there 
exists  a  nonzero  solution  for  the  complex  branch  voltages 
{14}  and  currents  {h}  [11].  A  lumped  linear  network  is 
said  to  be  stable  if  a)  it  has  no  natural  frequencies  in 
the  closed  right-half  plane  except  perhaps  at  the  origin, 
and  b)  any  natural  frequency  at  the  origin  results  only 
in  network  solutions  that  are  constant  as  functions  of 
time.  (The  latter  condition  rules  out  unstable  transient 
solutions  that  grow  polynomially  in  time  resulting  from 
a  repeated  natural  frequency  at  the  origin.) 

Theorem  1 

Consider  the  class  of  linear  networks  of  arbitrary  topol¬ 
ogy,  consisting  of  any  number  of  positive  2-terminal  resis¬ 
tors  and  capacitors  and  of  ;Y  lumped  linear  impedances 
Zn(s),u  =  1,2,...,  A',  that  are  open-  and  short-circuit 
stable  in  isolation,  i.e.,  that  have  no  poles  or  zeroes  in 
the  closed  right-half  plane.  Every  such  network  is  sta¬ 
ble  if  at  each  frequency  u>  >  0  there  exists  a  phase  angle 


0(u>)  such  that  0  >  >  -90°  and  \LZn(ju)-9(ju)\  < 

90°, n  =  1,2,..., AT. 

An  equivalent  statement  of  this  last  condition  is  that 
the  Nyquist  plot  of  each  cell’s  output  impedance  for  w  > 
0  never  intersects  the  dosed  2nd  quadrant,  and  that  no 
two  cells’  output  impedance  phase  angles  can  ever  differ 
by  as  much  as  180°.  If  all  the  active  cells  are  designed 
identically  and  fabricated  on  the  same  chip,  their  phase 
angles  should  track  fairly  closely  in  practice,  and  thus 
this  second  condition  is  a  natural  one. 

The  theorem  is  intuitively  reasonable.  The  assump¬ 
tions  guarantee  that  the  cells  cannot  resonate  with  one 
another  at  any  purely  sinusoidal  frequency  s  =  ju> 
since  their  phase  angles  can  never  differ  by  as  much 
as  180°,  and  they  can  never  resonate  with  the  resis¬ 
tors  and  capacitors  since  there  is  no  u  >  0  at  which 
both  Re{Zn(ju>)}  <  0  and  Im{Zn{ju)}  >  0  for  some 
n,l  <  n  <  N.  The  proof  formalizes  this  argument  us¬ 
ing  conservation  of  complex  power,  extends  it  to  rule  out 
natural  frequencies  in  the  right-half  plane  as  well,  and 
shows  why  instabilities  resulting  from  a  repeated  natural 
frequency  at  the  origin  cannot  occur. 

Proof  of  Theorem  1 

Let  s0  denote  a  natural  frequency  of  the  network  and 
{ V4 },{/*}  denote  any  complex  network  solution  at  s0. 
By  Tellegen's  theorem  [12],  or  conservation  of  complex 
power,  we  have 

X>fcfj  =  0,  (2) 

* 

i.e.,  for  So  f-  any  pole  of  Z„,n  =  1  and  sa  jk  C, 

£ \Ik\2Rk  +  £  |/*ia(*oC*r 1  +  £  |/niJz„(s0)  =  0  (3) 

resistors  capacitors  cells 

and  for  s0  #  any  zero  of  Zn,n  = 

£  I  v*  W  +  £  |v*|Js;c4  +  £  imJyB*(so)  =  0,(4) 

resistors  capacitors  cells 

where  the  superscript  *  denotes  the  complex  conjugate 
operation.  The  proof  is  completed  in  the  following  three 
parts,  which  together  rule  out  the  existence  of  any  natu¬ 
ral  frequencies  in  the  closed  right-half  plane  (except  pos¬ 
sibly  for  a  single  one  at  the  origin). 

Part  i) 

This  part  shows  that  there  are  no  natural  frequencies  at 
s„  =  juj  ^  0.  For  each  w  >  0  all  the  cell  impedance  values 


Closed  Second 


lie  strictly  below  and  to  the  right  of  a  half-space  bound¬ 
ary  passing  through  the  origin  of  the  complex  plane  at 
an  angle  6(u)  +  90°  with  the  real  positive  axis,  as  shown 
in  Fig.  5.  The  capacitor  impedances  {(jwC*)-1}  and 
the  resistor  impedances  {7?*}  also  lie  below  and  to  the 
right  of  this  line.  Thus  no  positive  linear  combination 
of  these  impedances  can  vanish  as  required  by  (3).  A 
similar  argument  holds  for  w  <  0. 

Part  ii) 

This  part  shows  that  there  cannot  exist  a  repeated  natu¬ 
ral  frequency  at  the  origin  that  leads  to  a  time-dependent 
solution.  The  assumptions  that  the  cell  impedances 
have  no  jw-axis  zeroes  and  that  their  Nyquist  plots  for 
w  >  0  never  intersect  the  closed  2nd  quadrant  imply 
that  Kn’(0)  >  0,  n  =  1  Thus  (4)  requires  that 

all  the  voltages  across  resistor  branches  and  cell  output 
branches  must  vanish  in  any  complex  network  solution  at 
s0  =  0.  Thus  only  capacitor  voltages  can  be  nonzero  and 
the  network  solution  will  be  unaltered  if  all  non-capacitor 
branches  are  replaced  by  short  circuits.  Dut  every  so¬ 
lution  to  a  network  comprised  only  of  positive,  linear 
2-terminal  capacitors  is  constant  in  time  (and  hence  sta¬ 
ble). 

Part  iii) 

This  part  uses  a  homotopy  argument  to  show  that  there 
are  no  natural  frequencies  in  the  open  right-half  plane. 
Assume  the  coutrary,  i.e.,  that  there  exists  such  a  net¬ 
work  with  a  natural  frequency  si  with  7?e{ji}  >  0.  Alter 
each  element  in  the  network  (except  resistors)  as  follows. 


For  each  cell  having  a  Zn(s)  of  relative  degree  less  than 
zero,  add  a  series  resistance  R\  for  all  other  cells  and  for 
capacitors,  add  a  parallel  conductance  G  to  each.  Call 
each  resulting  pair  a  “composite  element",  and  choose 
R  =  G  =  A  >  0.  For  A  sufficiently  large  all  natural  fre¬ 
quencies  must  lie  in  the  open  left-half  plane  since  every 
branch  element  is  strictly  passive  for  A  sufficiently  large. 
Since  the  natural  frequencies  are  continuous  functions  of 
A  (13)  and  Re{si}  >  0  for  A  =  0,  there  exists  some  A  >  0 
for  which  some  natural  frequency  s'  lies  on  the  imagi¬ 
nary  axis.  But  this  is  ruled  out  by  the  proof  in  part  i) 
unless  s'  =  0,  and  the  argument  in  part  ii)  rules  out 
s'  =  0,  since  any  network  solution  at  =  0  consists  of 
zero  branch  voltages  except  for  capacitor  branches,  and 
for  A  >  0  each  capacitor  has  a  positive  conductance  G 
in  parallel  with  it.  Since  the  voltage  across  every  G  is 
zero  in  such  a  network  solution,  all  branch  voltages  (and 
thus  all  branch  currents)  in  that  solution  must  be  zero, 
which  is  a  contradiction  because  a  natural  frequency  at 
5)  implies  the  existence  of  a  nonzero  solution. 

III.  Stability  Result  for  Networks  with 
Nonlinear  Resistors  and  Capacitors 

The  previous  results  for  linear  networks  can  afford 
some  limited  insight  into  the  behavior  of  nonlinear  net¬ 
works.  If  a  linearized  model  is  stable,  then  the  equilib¬ 
rium  point  of  the  original  nonlinear  network  is  locally 
stable.  But  the  result  in  this  section,  in  contrast,  applies 
to  the  full  nonlinear  circuit  model  and  allows  one  to  con¬ 
clude  that  in  certain  circumstances  the  network  cannot 
oscillate  even  if  the  initial  state  is  arbitrarily  far  from  the 
equilibrium  point. 

Terminology 

We  say  that  a  function  y  =  f(x)  lies  in  the  sector  [a,b]  if 
ax 7  <  xf(x)  <  bx2.  And  we  say  that  an  impedance  Z(s) 
satisfies  the  Popov  criterion  if  (1  +  ts)Z(s)  is  positive 
real  [7 ,8, and  9,  p.  174]  for  some  r  >  0.  (Note  that  this 
formulation  of  the  Popov  criterion  differs  slightly  from 
that  given  in  standard  references  [8  and  9,  p.  186].) 

Theorem  2 

Consider  a  network  consisting  of  possibly  nonlinear 
resistors  and  capacitors  and  cells  with  linear  output 
impedances  Z„{s),n  =  1,2,  ...,JV.  Suppose 

i)  the  resistor  curves  are  continuous  functions  i*  = 

9k{vk)  where  p*  lies  in  the  sector  >  0,  for 

all  resistors, 

ii)  the  capacitors  are  characterized  by  continuous  func¬ 
tions  t*  =  Ck(vk)lk  where  0  <  C*(vk)  <  Cm*,  for  all  k 
and  Vk ,  and 


iii)  the  impedances  Zn(s)  ail  satisfy  the  Popov  criterion 
for  some  common  value  of  r  >  0.  Then  the  network  is 
stable  in  the  sense  that,  for  any  initial  condition, 
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£‘*(0  ]dt<00. 

all  resistors 
and  capacitors 


(5) 


By  Tellegen’s  theorem,  for  any  set  of  initial  conditions 
and  any  time  T  >  0, 

/  E  (v*(t)  +  Thk(t))ik{t)dt  + 

®  reaiston 
rT 

J  E  (u*(0  +  rtk(t))ik(t)dt  + 

0  capaaton 


r T 

I  E  (vk(i)  +  rlk(t))ik(t)dt  =  0. 

®  c«U  impedances 


(6) 


For  resistors,  multiplying  the  sector  inequality  vg(v)  < 
Gtn*xV2  by  i  >  0  yields  i 2  =  iy(v)  <  G max‘«,  and  hence 


[T  <2(0*  S  [T  ik(t)vk(t)dt  = 

JO  Jo 


[T  ik(t)[vk(t)  +  r  »*(!)]dt  -  t[MMT))  -  MM 0)1  (7) 
Jo 

where 


Mv)  =  f  9k{v')dv'  >  0  (8) 

Jo 

is  the  resistor  co-content.  Using  the  inequality  (8)  in  (7) 
yields  for  each  resistor 

/  il(t)dt-rMMO))  <  [T ik(t)[vk(t)+rlk(t)]dt. 

Jo  Jo 

(9) 

For  capacitors,  integrating  the  inequality  = 
Gk(vk)t2k  ^  Cma*C*(u*)C2  yields 

r —  rum  <  r  fTcMmdt  = 

mix  JO  JO 


r — /  *2(0*-  A(«k(0)  <  f  *fc(0M0  +  ™k(0]*- 

vmai  JO  JO 

(12) 

And  for  the  cells,  the  assumption  that  (1  +  rs)Z„(s)  is 
positive  real  implies  that 

rT 

/  *,(0[»n(0  +  r«»(01*  >  -£„(0),  (13) 

Jo 

where  £n(0)  is  the  initial  “energy”  in  the  mathematically 
constructed  impedance  (l  +  rs)Z„(s)  at  t  =  0,  a  function 
of  the  initial  conditions  only.  Substituting  (9),  (12)  and 
(13)  into  (6)  yields 


G~nLj  E  <2(0*  + 

®  resistors 


j-j1 


E  <2(0*  < 

CApAciton 


*■  E  ^(**(0))+  E  ^(ff*(o))+E£»(°),  (14) 

resistors  CApAciton  cells 

where  the  right  hand  side  is  a  function  only  of  the  initial 
conditions.  Thus  (5)  holds.  | 

Note  that  Thm.  2,  as  it  is  stated,  applies  only  to 
networks  in  which  the  voltage  source  waveform  of  each 
cell’s  Thevenin  equivalent  circuit  is  identically  zero.  In 
practice,  these  voltages  are  generally  nonzero  and  change 
with  time.  Yet  a  necessary  condition  for  design  is 
that  the  circuit  be  stable  for  constant  Thevenin  voltages 
(which  would  result  from  a  constant  light  input).  If  tills 
condition  is  met,  then  the  effect  of  time  variation  can  be 
thought  of  as  an  issue  separate  from  stability  and  related 
to  the  convergence  rate  of  the  network  towards  a  “time- 
dependent  equilibrium  point.”  Thus,  it  is  appropriate  to 
extend  Thm.  2  to  include  the  case  of  cells  that  have 
arbitrary  but  constant  Thevenin  voltages.  This  can  be 
done  simply  by  requiring  the  resistor  curves  to  satisfy  the 
sector  condition  i)  of  the  theorem  about  all  possible  equi- 
Ibrium  points.  Even  if  there  is  no  known  restriction  on 
the  set  of  equilibrium  points,  the  sector  condition  will  be 
satisfied  at  every  equilibrium  point  if  all  the  gk’s  are  non¬ 
decreasing  differentiable  functions  with  bounded  slope. 

IV.  Concluding  Remarks 


!T  **(0M0  +  rik(t))dt  -  [Ek(qk{T)  -  Ek(gk( 0)],  (10) 
Jo 

where 

£*(?)=  /  vk(q')dq'>0  (11) 

Jo 

is  the  capacitor  energy.  Using  the  inequality  (11)  in  (10) 
yields  for  each  capacitor 


The  design  criteria  presented  here  are  simple  and  prac¬ 
tical,  though  at  present  their  validity  is  restricted  to  lin¬ 
ear  models  of  the  cells.  There  are  several  areas  of  further 
work  to  be  pursued,  one  of  which  is  an  analysis  of  the  cell 
that  includes  amplifier  clipping  effects.  Others  include 
the  synthesis  of  a  compensator  for  the  cell,  an  extension 
of  the  nonlinear  result  to  include  impedance  multipliers 
other  than  the  Popov  operator,  a  bound  on  the  network 
settling  time  when  the  optical  input  is  constant,  and  a 


bound  on  the  I2  norm  of  the  resistor  and  capacitor  cur¬ 
rents  in  terms  of  the  £j  norm  of  the  Thevenin  equivalent 
cell  voltage  waveforms  when  the  optical  input  is  time- 
varying. 


13.  M.  Marden,  Geometry  of  Polynomials,  American 
Mathematical  Society,  Providence,  RI,  no.  3,  1985,  pp. 
4-5. 
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Abstract 

Designers  of  switching  filter  circuits  are  often  interested  in  steady-state  and 
intermodulation  distortion  due  to  both  static  effects,  such  as  nonlinearities  in  the 
capacitors,  and  dynamic  effects,  such  as  the  charge  injection  during  MOS  transistor 
switching  or  slow  operational  amplifier  settling.  Steady-state  distortion  can  be  computed 
using  the  circuit  simulation  program  SPICE,  but  this  approach  is  computationally  very 
expensive.  Specialized  programs  for  switched  capacitor  filters  can  be  used  to  rapidly 
compute  steady-state  distortion,  but  do  not  consider  dynamic  effects.  In  this  paper  we 
present  a  new  mixed  frequency-time  approach  for  computing  both  steady-state  and 
intermodulation  distortion.  The  method  is  both  computationally  efficient  and  includes 
both  static  and  dynamic  distortion  sources.  The  method  has  been  implemented  in  a 
C  program,  Nitswit,  and  results  from  several  examples  are  presented. 
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EVALUATING  THE  PERFORMANCE  OF  SOFTWARE  CACHE  COHERENCE 


Susan  Owicki  and  Anant  Agarwal 


Abstract 

In  a  shared-memory  multiprocessor  with  private  caches,  cached  copies  of  a  data  item  must 
be  kept  consistent.  This  is  called  cache  coherence.  Both  hardware  and  software  coherence 
schemes  have  been  proposed.  Software  techniques  are  attractive  because  they  avoid 
hardware  complexity  and  can  be  used  with  any  processor-memory  interconnection.  This 
paper  presents  an  analytical  model  of  the  performance  of  two  software  coherence  schemes 
and,  for  comparison,  snoopy-cache  hardware.  The  model  is  validated  against  address 
traces  from  a  bus-based  multiprocessor.  The  behavior  of  the  coherence  schemes  under 
various  workloads  is  compared,  and  their  sensitivity  to  variations  in  workload  parameters  is 
assessed.  The  analysis  shows  that  the  performance  of  software  schemes  is  critically 
determined  by  certain  parameters  of  the  workload:  the  proportion  of  data  accesses,  the 
fraction  of  shared  references,  and  the  number  of  times  a  shared  block  is  accessed  before  it 
is  purged  from  the  cache.  Snoopy  caches  are  more  resilient  to  variations  in  these 
parameters.  Thus  when  evaluating  a  software  scheme  as  a  design  alternative,  it  is  essential 
to  consider  the  characteristics  of  the  expected  workload.  The  performance  of  the  two 
software  schemes  with  a  multistage  interconnection  network  is  also  evaluated,  and  it  is 
determined  that  both  scale  well. 
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The  J-Machine:  System  Support  for  Actors 


William  J.  Dally 


Abstract 


The  J-Machine  in  concert  with  its  operating  system  kernel,  JOSS,  provides  low-overhead 
system  services  to  support  actor  programming  systems.  The  J-Machine  is  not  specialized  to 
actor  systems;  instead,  it  provides  primitive  mechanisms  for  communication,  synchron¬ 
ization,  and  translation.  Communication  mechanisms  are  provided  that  permit  a  node  to 
send  a  message  to  any  other  node  in  the  machine  in  <  2ms.  On  message  arrival,  a  task  is 
created  and  dispatched  in  <  1ms.  A  translation  mechanism  supports  a  global  virtual 
address  space.  These  mechanisms  efficiently  support  most  proposed  models  of  concurrent 
computation.  The  hardware  is  an  ensemble  of  up  to  65,536  nodes  each  containing  a  36-bit 
processor,  4K  36-bit  words  of  memory,  and  a  router.  The  nodes  are  connected  by  a  high¬ 
speed  3-D  mesh  network.  This  design  was  chosen  to  make  the  most  efficient  use  of 
available  chip  and  board  area. 
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Universal  Packet  Routing  Algorithms 
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Abstract 

In  this  paper  we  examine  the  packet  routing  problem  in  a  network  independent  context. 
Our  goal  is  to  devise  a  strategy  for  routing  that  works  well  for  a  wide  variety  of  networks. 
To  achieve  this  goal,  we  partition  the  routing  problem  into  two  stages:  a  path  selection 
stage  and  a  scheduling  stage. 

In  the  first  stage  we  find  paths  for  the  packets  with  small  maximum  distance,  dy  and  small 
maximum  congestion,  c.  Once  the  paths  are  fixed,  both  are  lower  bounds  on  the  time 
required  to  deliver  the  packets.  In  the  second  stage  we  find  a  schedule  for  the  movement 
of  each  packet  along  its  path  so  that  no  two  packets  traverse  the  same  edge  at  the  same 
time,  and  so  that  the  total  time  and  maximum  queue  size  required  to  route  all  of  the 
packets  to  their  destinations  are  minimized.  For  many  graphs,  the  first  stage  is  easy  -  we 
simply  use  randomized  intermediate  destinations  as  suggested  by  Valiant.  The  second 
stage  is  more  challenging,  however,  and  is  the  focus  of  this  paper.  Our  results  include: 

1.  a  proof  that  there  is  a  schedule  of  length  0(c+d)  requiring  only  constant  size  queues 
for  any  set  of  paths  with  distance  d  and  congestion  c, 

2.  a  Randomized  on-line  algorithm  for  routing  any  set  of  N  “leveled”  paths  on  a 
bounded-degree  network  in  0(c+d + log  N)  steps  using  constant  size  queues, 

3.  the  first  on-line  algorithm  for  routing  V-packets  in  the  AT-node  shuffle-exchange  graph 
in  0( log  N)  steps  using  constant  size  queues,  and 

4.  the  first  constructions  of  area  and  volume-universal  networks  requiring  only  0( log  V) 
slow-down. 
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Criteria  for  Robust  Stability  in  a  Class  of  Lateral 
Inhibition  Networks  Coupled  Through  Resistive  Grids 

John  L.  Wyatt,  Jr.  and  David  L.  Standley 


Abstract 

In  the  analog  VLSI  implementation  of  neural  systems,  it  is  sometimes  convenient  to  build 
lateral  inhibition  networks  by  using  a  locally  connected  on-chip  resistive  grid  to 
interconnect  active  elements.  A  serious  problem  of  unwanted  spontaneous  oscillation 
often  arises  with  these  circuits  and  renders  them  unusable  in  practice.  This  paper  reports 
on  criteria  that  guarantee  these  and  certain  other  systems  will  be  stable,  even  though  the 
values  of  designed  elements  in  the  resistive  grid  may  be  imprecise  and  the  location  and 
values  ot  parasitic  elements  may  be  unknown.  The  method  is  based  on  a  rigorous, 
somewhat  novel  mathematical  analysis  using  Tellegen’s  theorem  from  electrical  circuits 
and  the  idea  of  a  Popov  multiplier  from  control  theory.  The  criteria  are  local  in  that  no 
overall  analysis  of  the  interconnected  system  is  required  for  their  use,  empirical  in  that  they 
involve  only  measurable  frequency  response  data  on  the  individual  cells,  and  robust  in  that 
they  are  insensitive  to  network  topology  and  to  unmodelled  parasitic  resistances  and 
capacitances  in  the  interconnect  network.  Certain  results  are  robust  in  the  additional  sense 
that  specified  nonlinear  elements  in  the  grid  do  not  affect  the  stability  criteria.  The  results 
are  designed  to  be  applicable,  with  further  development,  to  complex  and  incompletely 
modelled  living  neural  systems. 
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Stability  Criterion  for  Lateral  Inhibition  and  Related  Networks 
that  is  Robust  in  the  Presence  of  Integrated  Circuit  Parasitics 
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Abstract 

In  the  analog  VLSI  implementation  of  neural  systems,  it  is  sometimes  convenient  to  build 
lateral  inhibition  networks  by  using  a  locally  connected  on-chip  resistive  grid.  A  serious 
problem  of  unwanted  spontaneous  oscillation  often  arises  with  these  circuits  and  renders 
them  unusable  in  practice.  This  paper  reports  a  design  approach  that  guarantees  such  a 
system  will  be  stable,  even  though  the  values  of  designed  elements  in  the  resistive  grid  may 
be  imprecise  and  the  location  and  values  of  parasitic  elements  may  be  unknown.  The 
method  is  based  on  a  mathematical  analysis  using  Tellegen’s  theorem  and  the  Popov 
criterion.  The  criteria  are  local  in  the  sense  that  no  overall  analysis  of  the  interconnected 
system  is  required  for  their  use,  empirical  in  the  sense  that  they  involve  only  measurable 
frequency  response  data  on  the  individual  cells,  and  robust  in  the  sense  that  they  are  not 
affected  by  unmodelled  parasitic  resistances  and  capacitances  in  the  interconnect  network. 
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Abstract 

In  this  paper,  we  outline  a  synthesis  procedure,  which 
beginning  from  a  State  Transition  Graph  description 
of  a  sequential  machine,  produces  an  optimized  easily 
testable  PLA-based  logic  implementation. 

Previous  approaches  to  synthesizing  easily  testable 
sequential  machines  have  concentrated  on  the  stuck-at 
fault  model.  For  PLAs,  an  extended  fault  model  called 
the  crosspoint  fault  model  is  used.  In  this  paper,  we 
propose  a  procedure  of  constrained  state  assignment 
and  logic  optimization  which  guarantees  testability 
for  all  combinationally  irredundant  crosspoint  faults  in  a 
PLA-based  finite  state  machine.  No  direct  access  to  the 
flip-flops  is  required.  The  test  sequences  to  detect  these 
faults  can  be  obtained  using  combinational  test  genera¬ 
tion  techniques  alone.  This  procedure  thus  represents  an 
alternative  to  a  Scan  Design  methodology.  We  present 
results  which  illustrate  the  efficacy  of  this  procedure  — 
the  area/performance  penalties  in  return  for  easy  testa¬ 
bility  are  small. 

1  Introduction 

Test  generation  for  sequential  circuits  has  long  been  rec¬ 
ognized  as  a  difficult  task  [4].  Several  approaches  [3]  [18] 
[15]  [14]  [17]  [19]  have  been  taken  in  the  past  to  solve 
the  problem  of  test  generation  for  sequential  circuits. 
They  are  either  extensions  to  the  classical  D- Algorithm 
or  based  on  random  techniques  [18]  [17].  When  the  num¬ 
ber  of  states  of  the  circuit  is  large  and  the  tests  demand 
long  input  sequences,  they  can  be  quite  ineffective  for 
test  generation. 

For  sequential  circuits,  design  for  testability  has  been 
a  synonym  for  the  use  of  full  Scan  Design  techniques, 
such  as  the  LSSD  approach  [10]  pioneered  by  IBM.  This 
method  converts  the  difficult  problem  of  testing  sequen¬ 
tial  circuits,  into  a  much  easier  one,  that  of  testing  a 
combinational  circuit.  However,  there  are  cases  where 


the  area  and  timing  penalty  associated  with  LSSD  tech¬ 
niques  are  not  acceptable  to  designers. 

Logic  synthesis  and  minimization  techniques  can,  in 
principle,  ensure  fully  and  easily  testable  combinational 
and  sequential  circuit  designs.  In  [1],  a  synthesis  proce¬ 
dure  which  guaranteed  fully  testable  irredundant  combi¬ 
national  logic  circuits  was  proposed.  In  [8],  a  procedure 
which  produced  a  fully  and  easily  testable  logic-level  se¬ 
quential  machines  from  State  Transition  Graph  descrip¬ 
tions  was  proposed.  The  work  in  [8]  showed  that  state 
assignment  has  a  profound  effect  on  the  testability  of  a 
sequential  machine.  Recently,  an  optimal  synthesis  pro¬ 
cedure,  that  guarantees  full  non-scan  testability  under 
the  stuck-at  fault  model,  with  no  associated  area  or  per¬ 
formance  penalty  has  been  proposed  [6]. 

Programmable  Logic  Arrays  (PLAs)  are  used  exten¬ 
sively  in  the  design  of  complex  VLSI  systems.  Sequen¬ 
tial  functions  can  be  realized  very  efficiently  by  adding 
feedback  registers  to  the  I'LA.  Numerous  programs  for 
the  optima]  synthesis  of  l‘  LA- based  finite  state  machines 
have  been  developed  (e.g.  [16],  [7]).  Test  generation  and 
design-for-testability  techniques  for  PLA  structures  have 
been  active  areas  of  research. 

Due  to  a  PLA’s  dense  layout,  PLA  faults  other  than 
conventional  stuck-at  faults  can  occur  easily  and  must 
be  modeled.  An  extended  model,  the  crosspoint  fault 
model,  has  been  proposed  in  [5]  and  [12].  The  crosspoint- 
oriented  test  set  covers  many  of  the  frequently  occurring 
physical  faults,  including  shorts  between  lines.  Several 
PLA  test  generation  techniques  aimed  at  the  crosspoint 
fault  model  have  been  proposed  (e.g.  [13],  [9]).  In  partic¬ 
ular,  an  exact  and  efficient  technique  which  guarantees 
maximum  fault  coverage  and  identification  ol  all  redun¬ 
dant  faults  was  proposed  in  [20]. 

Design-for-testability  techniques  (e.g.  [11])  for  PLAs 
require  controllability  of  all  inputs  and  observability  of 
all  outputs  of  tin-  I 'I.  A.  Synthesis  approaches  to  produc¬ 
ing  easily  testable  sequentinl  machines,  without  requiring 
direct  access  to  the  input ^/outputs  of  the  circuit’s  mem¬ 
ory  elements,  have  not  been  aimed  at  the  crosspoint  fault 
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Abstract 

The  performance  of  cache-coherent  multiprocessors  is  strongly  influenced  by  locality  in 
the  memory  reference  behavior  of  parallel  applications.  While  the  notions  of  temporal  and 
spatial  locality  in  uniprocessor  memory  references  are  well  understood,  the  corresponding 
notions  of  locality  in  multiprocessors  and  their  impact  on  multiprocessor  cache  behavior  are 
not  clear.  A  locality  model  suitable  for  multiprocessor  cache  evaluation  is  derived  by  viewing 
memory  references  as  streams  of  processor  identifiers  directed  at  specific  cache/memory 
blocks.  This  viewpoint  differs  from  the  traditional  uniprocessor  approach  that  uses  streams 
of  addresses  to  different  blocks  emanating  from  specific  processors.  Our  view  is  based  on  the 
intuition  that  cache  coherence  traffic  in  multiprocessors  is  largely  determined  by  the  number 
of  processors  accessing  a  location,  the  frequency  with  which  they  access  the  location,  and  the 
sequence  in  which  their  accesses  occur.  The  specific  locations  accessed  by  each  processor, 
the  time  order  of  access  to  different  locations,  and  the  size  of  the  working  set  play  a  smaller 
role  in  determining  the  cache  coherence  traffic,  although  they  still  influence  intrinsic  cache 
performance.  Looking  at  traces  from  the  viewpoint  of  a  memory  block  leads  to  a  new  notion 
of  reference  locality  for  multiprocessors,  called  processor  locality.  In  this  paper,  we  study  the 
temporal,  spatial,  and  processor  locality  in  the  memory  reference  patterns  of  three  parallel 
applications.  Based  on  the  observed  locality,  we  then  reflect  on  the  expected  cache  behavior 
of  the  three  applications. 


1  Introduction 

Multiprocessors  often  use  caches  to  reduce  their  network  bandwidth  requirements.  Caches  retain 
recently  accessed  data  so  that  repeat  references  to  this  data  in  the  near  future  and  will  not 
require  network  traversals.  Repeated  access  to  the  same  data  in  a  given  interval  of  time  is  the 
property  of  temporal  locality  of  memory  references  and  has  been  well  studied  in  single  processor 
systems  [1,  2].  Spatial  locality  of  memory  references  is  another  related  property  of  memory 
references  that  places  a  high  probability  of  access  to  data  close  to  previously  accessed  data. 
Again,  this  property  of  single  processor  programs  has  been  widely  observed.  The  viability  of 
cache-coherent  multiprocessors  is  strongly  predicated  on  whether  the  multiprocessor  caches  can 
exploit  locality  of  memory  referencing. 

'Preliminary  result*  of  this  study  were  reported  in  Sigmetrics  1988. 
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Abstract 

In  this  paper,  we  present  approaches  to  multi-level  se¬ 
quential  logic  synthesis  —  algorithms  and  techniques  for 
the  area  and  performance  optimization  of  interconnected 
finite  state  machine  descriptions. 

Interacting  finite  state  machines  are  common  in  in¬ 
dustrial  chip  designs.  While  optimization  techniques 
for  single  finite  state  machines  are  relatively  well  devel¬ 
oped,  the  problem  of  optimization  across  latch  bound¬ 
aries  has  received  much  less  attention.  Techniques  to 
optimize  pipelined  combinational  logic  so  as  to  im¬ 
prove  area/throughput  have  been  proposed.  However, 
logic  cannot  be  straightforwardly  migrated  across  latch 
boundaries  when  the  basic  blocks  are  sequential  rather 

than  combinational  circuits.  ,  .  .  .  , 

We  present  new  techniques  for  the  exploitation  of  se¬ 
quential  don’t  cares  in  arbitrary,  interconnected  sequen¬ 
tial  machine  structures.  Exploiting  these  don’t  care  se¬ 
quences  can  result  in  significant  improvements  in  area 
and  performance.  We  address  the  problem  of  migrating 
logic  across  state  machine  boundaries  so  as  to  make  par¬ 
ticular  machines  less  complex  at  the  possible  expense  of 
making  others  more  complex.  This  can  be  useful  from 
both  an  area  and  performance  point  of  view.  We  present 
new  optimization  algorithms  that  incrementally  modify 

State  machine  structures  across  latch  boundaries.  We 
iiscuss  the  use  of  more  global  state  machine  decomposi¬ 
tion  and  factorization  algorithms  for  area  optimization. 
Finally,  we  present  experimental  results  using  these  al¬ 
gorithms  on  sequential  circuits. 

1  Introduction 

Interacting  finite  state  machines  (FSMs)  are  common  in 
chips  being  designed  today.  The  advantages  of  a  hier¬ 
archical,  distributed-style  specification  and  realization 
are  many.  While  the  terminal  behavior  of  any  set  of  in¬ 
terconnected  sequential  circuits  can  be  modeled  and/or 
realized  by  a  lumped  circuit,  the  former  can  be  consider¬ 
ably  more  compact,  as  well  as  being  easy  to  understand 
and  manipulate. 

The  disadvantages  of  this  form  of  specification  from 
a  CAD  point  of  view  are  that  sequential  logic  synthesis 
algorithms  are  generally  restricted  to  operate  on  lumped 
circuits.  State  assignment  algorithms  (e.g.  [1],  [8],  [3]), 
for  instance,  almost  exclusively  operate  on  single  finite 
state  machines.  Given  a  set  of  interacting  machines 


represented  by  State  Transition  Graphs,  algorithms 
that  encode  the  internal  states  of  the  machines,  taking 
into  account  their  interactions,  do  not  exist  to  date.  If 
indeed,  the  machines  are  encoded  separately,  disregard¬ 
ing  their  interconnectivity,  a  sub-optimal  state  assign¬ 
ment  can  result  (and  generally  does). 

Traditionally,  the  decomposition  of  an  initial  circuit 
specification  into  smaller,  interacting  sequential  circuits 
has  been  performed  by  the  logic  designer.  Once  a 
decomposition  has  been  performed,  it  is  almost  never 
changed  and  logic  synthesis  tools  operate  on  separate 
logic  blocks  independently.  Unfortunately,  there  are  no 
guarantees  regarding  the  quality  of  the  initial  decom¬ 
position,  in  terms  of  minimality  of  communication  be¬ 
tween  the  machines  and/or  complexities  of  the  individ¬ 
ual  machines.  There  exist  automatic  techniques  that 
can  decompose  lumped  sequential  circuit'  into  smaller, 
interacting  ones  (e  g.  [5]).  These  techm  -s  are  limited 
in  the  topology  of  interconnections  that  tin  be  achieved 
and  severely  limited  in  their  capabilities  of  handling  cir¬ 
cuits  of  large  size.  Flattening  the  initial,  distributed 
specification  can  result  in  a  very  large  lumped  circuit. 

Efficient  and  flexible  algorithms  for  re-partitioning  in¬ 
teracting  sequential  circuits  for  area  and  performance 
optimization  have  not  been  proposed  in  the  past.  Work 
has  been  done  in  re-partitioning  pipelined  combina¬ 
tional  logic  stages  (e.g.  [6]).  There  is  no  restriction 
on  migrating  logic  across  latch  boundaries  when  the  ba¬ 
sic  blocks  are  combinational,  provided  the  latches  are 
not  observable  —  the  functionality  of  the  circuit  is  un¬ 
changed  by  moving  say,  one  gate  from  before  to  after 
a  latch.  However,  when  sequential  circuits  are  inter¬ 
connected,  as  shown  in  Figure  1,  one  cannot  arbitrarily 
move  logic  across  pipeline  latch  boundaries  (We  refer  to 
flip-flops  that  store  state  as  state  latches  and  flip-flops 
that  store  intermediate  values  as  pipeline  latches).  The 
functionality  and  terminal  behavior  of  the  circuit  will 
be  changed,  even  though  the  latches  are  not  observable. 

One  wishes  to  be  able  to  migrate  logic  across  pipeline 
latch  boundaries  for  several  reasons.  The  duration  of 
the  system  clock  has  to  be  greater  than  the  longest 

path  between  any  two  pipeline  stages.  If  a  machine. 

A,  is  significantly  more  complex  than  another  machine 

B,  the  critical  path/system  clock  may  be  unnecessarily 
long.  The  clock  cycle  could  be  shortened  by  making  A 
less  complex  at  the  possible  expense  of  making  B  more 
complex.  In  the  best  case,  the  complexities  of  both  A 
and  B  would  decrease. 

Another  very  important  issue  is  the  specification  and 
exploitation  of  < U  -m  i  cares  in  interconnected  FSM  de¬ 
scriptions.  For  •  \  i tuple,  in  Figure  1,  certain  binary 
combinations  may  never  appear  at  the  set  of  latches 
LI.  This  will  correspond  to  an  incompletely  specified 
machine  B.  These  don’t  cares  can  be  exploited  us¬ 
ing  standard  state  minimization  strategies  [9].  A  more 
complicated  form  of  don’t  care,  referred  to  here  as  a  se- 
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An  area-universal  network  is  one  which  can  efficiently  simulate  any  other  network  of  comparable  area. 
This  paper  extends  previous  results  on  area-universal  networks  in  several  ways.  First,  it  considers  the 
size  (amount  of  attached  memory)  of  processors  comprising  the  networks  being  compared.  It  shows  that 
an  appropriate  universal  network  of  area  0(A)  built  from  processors  of  size  IgA  requires  only  0(lg2  A) 
slowdown  in  biUttmes  to  simulate  any  network  of  area  A,  without  any  restriction  on  processoi  »i_c  or 
number  of  processors  in  the  competing  network.  Furthermore,  the  universal  network  can  be  designed 
so  that  any  message  traversing  a  path  of  length  d  in  the  competing  network  need  follow  a  path  of  only 
0(d  +  IgA)  length  in  the  universal  network.  Thus,  the  results  are  almost  entirely  insensitive  to  removal 
of  the  unit  wire  delay  assumption  used  in  previous  work.  This  paper  also  derives  upper  bounds  on  the 
slowdown  required  by  a  universal  network  to  simulate  a  network  of  larger  area  and  shows  that  all  of  the 
simulation  results  are  valid  even  without  the  usual  assumption  that  computation  and  communication  of 
the  competing  network  proceed  in  separate  phases. 


1  Introduction 

This  paper  provides  several  advances  in  the  search  for  the  best  way  to  make  use  of  a  fixed  amount  of 
physical  space  when  building  a  general-purpose  parallel  computer.  The  focus  on  space  consumed  by  both 
processors  and  interconnect  represents  an  attempt  to  better  measure  real-world  costs  than  to  merely  count 
the  number  of  processors.  The  results  of  this  paper  are  stated  in  terms  of  area  required  under  standard 
two-dimensional  VLSI  modeling  assumptions.  The  extension  to  three-dimensions  is  fairly  straightforward 
using  the  ideas  in  [2]. 

The  notion  of  a  routing  network  which  is  universal  for  a  given  amount  of  physical  space  was  introduced 
by  Leiserson  in  [5],  That  paper  introduces  a  class  of  routing  networks  referred  to  as  fat-trees  and  shows 
that  an  appropriate  n-processor  network  from  this  class  can  simulate  (off-linej_anyother  routing  network 
connecting  the  same  processors  and  occupying  the  same  volume,  with  only  £(lg3  nj) factor  degradation  in 
the  tune  required.  A  slight  restriction  on  the  number  of  processors  in  the  competing  network  was  required, 
because  Leiserson 's  fat-trees  used  area  slightly  more  than  linear  in  the  number  of  processors. 

The  approach  to  proving  fat-trees  universal  is  twofold.  First  it  is  shown  that  any  competing  network 
can  be  mapped  to  a  fat-tree  without  placing  too  great  a  communications  load  on  any  of  the  communication 

This  research  *ai  supported  in  part  by  the  Defense  Advanced  Research  Projects  Agency  under  Contract  N 00014-87- 
K-0825,  by  the  Office  of  Naval  Research  under  Contract  N 0001 +-8&-K -0593,  and  by  a  Fannie  and  John  Hem  Foundation 
Fellowship. 
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Abstract 

This  thesis  presents  the  formal  background  for  a  mathematical  model  for  level- 
clocked  circuitry,  in  which  latches  are  controlled  by  the  levels  (high  or  low)  of  clock 
signals  rather  than  transitions  (edges)  of  the  clocks.  Such  level-docked  circuits  are 
frequently  used  in  MOS  VLSI  design.  Our  model  maps  continuous  data-domains, 
such  as  voltage,  into  discrete,  or  digital ,  data  domains,  while  retaining  a  continuous 
notion  of  time.  A  level-clocked  circuit  is  represented  as  a  graph  G  =  (V,E),  where  V 
consists  of  digital  components — latches  and  functional  elements — and  E  represents 
inter-component  connections. 

The  majority  of  this  thesis  concentrates  on  developing  lemmas  and  theorems  that 
can  serve  as  a  set  of  “axioms”  when  analyzing  algorithms  based  on  the  model.  Key 
axioms  include  the  fact  that  circuits  in  our  model  generate  only  well  defined  digital 
signals,  and  the  fact  that  components  in  our  model  support  and  accurately  handle  the 
“undefined”  values  that  electrical  signals  must  take  on  when  they  make  a  transition 
between  valid  logic  levels.  In  order  to  facilitate  proofs  for  circuit  properties,  the  class 
of  computational  predicates  is  defined.  A  circuit  property  can  be  proved  by  simply 
casting  the  property  as  a  computational  predicate. 
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Abstract 

Achieving  high  rates  of  floating-point  computation  is  one  of  the  primary  goals  of  many  computer 
designs.  Many  high  speed  floating-point  datapaths  have  been  designed  in  order  to  address  this 
problem.  However,  conventional  designs  often  neglect  the  real  problem  in  achieving  high  perfor¬ 
mance  floating-point:  providing  the  necessary  I/O  bandwidth  to  keep  the  high  speed  datapaths 
busy. 

The  Reconfigurable  Arithmetic  Processor  (RAP)  is  an  arithmetic  processing  node  for  a  message¬ 
passing,  MIMD  concurrent  computer.  Its  datapath  is  designed  to  sustain  high  rates  of  floating-point 
operations,  while  requiring  only  a  fraction  of  the  I/O  bandwidth  required  by  a  conventional  floating¬ 
point  datapath.  The  RAP  incorporates  on  one  chip  eight  4-bit  serial,  64  bit  floating-point  arithmetic 
units  connected  by  a  switching  network.  By  sequencing  the  switch  through  different  patterns,  the 
RAP  chip  calculates  complete  arithmetic  formulas.  By  chaining  together  its  arithmetic  units  the 
RAP  eliminates  the  I/O  bandwidth  associated  with  storing  and  retrieving  intermediate  results,  and 
reduces  the  amount  of  off  chip  data  transfer. 

This  Thesis  describes  and  evaluates  the  RAP  architecture.  It  presents  two  important  aspects  of  the 
chip  design:  the  control  logic  design,  and  the  schematic  level  design  of  the  RAP  datapath.  The  RAP 
datapath  design  includes  the  design  of  two  4-bit  serial  floating-point  units:  an  adder/subtractor 
unit  and  a  multiplier  unit.  In  order  to  use  the  RAP  datapath,  a  compiler  is  developed  that  takes 
as  input  a  list  of  mathematical  expressions,  and  outputs  a  series  of  switch  configurations  to  be  used 
by  the  RAP  to  do  the  calculation. 

On  23  benchmark  problems,  the  RAP  reduced  both  the  on  chip  and  off  chip  bandwidth  requirements 
by  an  average  of  64%,  when  compared  the  bandwidth  required  by  a  conventional  arithmetic  chip 
that  does  not  exploit  locality.  Average  floating-point  performance  is  3.40  Millions  of  Floating-point 
operations  per  second  (MFlops). 
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